Skip to main content

TOPS: Transition-Based Volatility-Reduced Policy Search

  • Conference paper
  • First Online:
  • 249 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13441))

Abstract

Existing risk-averse reinforcement learning approaches still face several challenges, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Transition-based vOlatility-controlled Policy Search (TOPS), a novel algorithm that solves risk-averse problems by learning from transitions. We prove that our algorithm—under the over-parameterized neural network regime—finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient. The convergence rate is comparable to the state-of-the-art risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of TOPS’ performance among existing risk-averse policy search methods.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For more details on double-sampling and the more general compositional expectations, please refer to [30, 52].

References

  1. Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. J. Mach. Learn. Res. 22(98), 1–76 (2021)

    MathSciNet  Google Scholar 

  2. Allen-Zhu, Z., Li, Y., Liang, Y.: Learning and generalization in overparameterized neural networks, going beyond two layers. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  3. Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization (2019)

    Google Scholar 

  4. Antos, A., Szepesvári, C., Munos, R.: Fitted Q-iteration in continuous action-space MDPs. In: Advances in Neural Information Processing Systems 20 (2007)

    Google Scholar 

  5. Arora, S., Du, S., Hu, W., Li, Z., Wang, R.: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In: International Conference on Machine Learning, pp. 322–332. PMLR (2019)

    Google Scholar 

  6. Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 (2019)

  7. Bisi, L., Sabbioni, L., Vittori, E., Papini, M., Restelli, M.: Risk-averse trust region optimization for reward-volatility reduction. In: Bessiere, C. (ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-2020, pp. 4583–4589. International Joint Conferences on Artificial Intelligence Organization, July 2020. Special Track on AI in FinTech

    Google Scholar 

  8. Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)

  9. Cai, Q., Yang, Z., Lee, J.D., Wang, Z.: Neural temporal-difference and Q-learning provably converge to global optima. arXiv preprint arXiv:1905.10027 (2019)

  10. Cao, Y., Gu, Q.: Generalization error bounds of gradient descent for learning over-parameterized deep ReLU networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3349–3356 (2020)

    Google Scholar 

  11. Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4), 2563–2578 (2021)

    Article  MathSciNet  Google Scholar 

  12. Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  13. Dabney, W., et al.: A distributional code for value in dopamine-based reinforcement learning. Nature 577(7792), 671–675 (2020)

    Article  Google Scholar 

  14. Di Castro, D., Tamar, A., Mannor, S.: Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404 (2012)

  15. Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)

  16. Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C., Mannor, S.: Regularized policy iteration with nonparametric function spaces. J. Mach. Learn. Res. 17(1), 4809–4874 (2016)

    MathSciNet  Google Scholar 

  17. Fu, Z., Yang, Z., Wang, Z.: Single-timescale actor-critic provably finds globally optimal policy. arXiv preprint arXiv:2008.00483 (2020)

  18. Garcıa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)

    MathSciNet  Google Scholar 

  19. Gu, H., Guo, X., Wei, X., Xu, R.: Mean-field multi-agent reinforcement learning: a decentralized network approach. arXiv preprint arXiv:2108.02731 (2021)

  20. Hans, A., Schneegaß, D., Schäfer, A.M., Udluft, S.: Safe exploration for reinforcement learning. In: ESANN, pp. 143–148. Citeseer (2008)

    Google Scholar 

  21. Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proceedings of the 19th International Conference on Machine Learning. Citeseer (2002)

    Google Scholar 

  22. Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems 14 (2001)

    Google Scholar 

  23. Konstantopoulos, T., Zerakidze, Z., Sokhadze, G.: Radon-Nikodým theorem. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science, pp. 1161–1164. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-04898-2_468

    Chapter  Google Scholar 

  24. Kovács, B.: Safe reinforcement learning in long-horizon partially observable environments (2020)

    Google Scholar 

  25. Kubo, M., Banno, R., Manabe, H., Minoji, M.: Implicit regularization in over-parameterized neural networks. arXiv preprint arXiv:1903.01997 (2019)

  26. La, P., Ghavamzadeh, M.: Actor-critic algorithms for risk-sensitive MDPs. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. (2013)

    Google Scholar 

  27. Lai, T.L., Xing, H., Chen, Z.: Mean-variance portfolio optimization when means and covariances are unknown. Ann. Appl. Stat. 5(2A), June 2011. https://doi.org/10.1214/10-aoas422

  28. Laroche, R., Tachet des Combes, R.: Dr Jekyll and Mr Hyde: the strange case of off-policy policy updates. In: Advances in Neural Information Processing Systems 34 (2021)

    Google Scholar 

  29. Li, D., Ng, W.L.: Optimal dynamic portfolio selection: multiperiod mean-variance formulation. Math. Financ. 10(3), 387–406 (2000)

    Article  MathSciNet  Google Scholar 

  30. Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., Petrik, M.: Finite-sample analysis of proximal gradient TD algorithms. In: Proceedings of the Conference on Uncertainty in AI (UAI), pp. 504–513 (2015)

    Google Scholar 

  31. Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural trust region/proximal policy optimization attains globally optimal policy. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  32. Majumdar, A., Pavone, M.: How should a robot assess risk? Towards an axiomatic theory of risk in robotics. In: Amato, N.M., Hager, G., Thomas, S., Torres-Torriti, M. (eds.) Robotics Research. SPAR, vol. 10, pp. 75–84. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-28619-4_10

    Chapter  Google Scholar 

  33. Mannor, S., Tsitsiklis, J.: Mean-variance optimization in Markov decision processes. arXiv preprint arXiv:1104.5601 (2011)

  34. Markowitz, H.M., Todd, G.P.: Mean-Variance Analysis in Portfolio Choice and Capital Markets, vol. 66. Wiley, New York (2000)

    Google Scholar 

  35. Mei, J., Xiao, C., Szepesvari, C., Schuurmans, D.: On the global convergence rates of softmax policy gradient methods. In: International Conference on Machine Learning, pp. 6820–6829. PMLR (2020)

    Google Scholar 

  36. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  37. Munos, R.: Performance bounds in Lp-norm for approximate value iteration. SIAM J. Control. Optim. 46(2), 541–561 (2007)

    Article  MathSciNet  Google Scholar 

  38. Munos, R., Szepesvári, C.: Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9(5), 815–857 (2008)

    MathSciNet  Google Scholar 

  39. Parker, D.: Managing risk in healthcare: understanding your safety culture using the Manchester patient safety framework (MaPSaF). J. Nurs. Manag. 17(2), 218–222 (2009)

    Article  MathSciNet  Google Scholar 

  40. Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems 21 (2008)

    Google Scholar 

  41. Satpathi, S., Gupta, H., Liang, S., Srikant, R.: The role of regularization in overparameterized neural networks. In: 2020 59th IEEE Conference on Decision and Control (CDC), pp. 4683–4688. IEEE (2020)

    Google Scholar 

  42. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897. PMLR (2015)

    Google Scholar 

  43. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  44. Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5668–5675 (2020)

    Google Scholar 

  45. Sobel, M.J.: The variance of discounted Markov decision processes. J. Appl. Probab. 19(4), 794–802 (1982)

    Article  MathSciNet  Google Scholar 

  46. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. A Bradford Book. MIT Press, Cambridge (2018)

    Google Scholar 

  47. Sutton, R.S., et al.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: International Conference on Machine Learning, pp. 993–1000 (2009)

    Google Scholar 

  48. Thomas, G., Luo, Y., Ma, T.: Safe reinforcement learning by imagining the near future. In: Advances in Neural Information Processing Systems 34 (2021)

    Google Scholar 

  49. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)

    Google Scholar 

  50. Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)

    Article  Google Scholar 

  51. Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence (2019)

    Google Scholar 

  52. Wang, M., Fang, E.X., Liu, H.: Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program. 161(1–2), 419–449 (2017)

    Article  MathSciNet  Google Scholar 

  53. Wang, W.Y., Li, J., He, X.: Deep reinforcement learning for NLP. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 19–21 (2018)

    Google Scholar 

  54. Weng, J., Duburcq, A., You, K., Chen, H.: MuJoCo benchmark (2020). https://tianshou.readthedocs.io/en/master/tutorials/benchmark.html

  55. Xie, T., et al.: A block coordinate ascent algorithm for mean-variance optimization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018). https://proceedings.neurips.cc/paper/2018/file/4e4b5fbbbb602b6d35bea8460aa8f8e5-Paper.pdf

  56. Xu, P., Chen, J., Zou, D., Gu, Q.: Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In: Advances in Neural Information Processing Systems (2018)

    Google Scholar 

  57. Xu, T., Liang, Y., Lan, G.: CRPO: a new approach for safe reinforcement learning with convergence guarantee. In: International Conference on Machine Learning, pp. 11480–11491. PMLR (2021)

    Google Scholar 

  58. Yang, L., Wang, M.: Reinforcement learning in feature space: matrix bandit, kernels, and regret bound. In: International Conference on Machine Learning, pp. 10746–10756. PMLR (2020)

    Google Scholar 

  59. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)

    Article  Google Scholar 

  60. Zhang, S., Liu, B., Whiteson, S.: Mean-variance policy iteration for risk-averse reinforcement learning. In: AAAI Conference on Artificial Intelligence (AAAI) (2021)

    Google Scholar 

  61. Zhang, S., Tachet, R., Laroche, R.: Global optimality and finite sample analysis of softmax off-policy actor critic under state distribution mismatch. arXiv preprint arXiv:2111.02997 (2021)

  62. Zhong, H., Fang, E.X., Yang, Z., Wang, Z.: Risk-sensitive deep RL: variance-constrained actor-critic provably finds globally optimal policy (2020)

    Google Scholar 

  63. Zou, D., Cao, Y., Zhou, D., Gu, Q.: Gradient descent optimizes over-parameterized deep ReLU networks. Mach. Learn. 109(3), 467–492 (2020)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgment

BL’s research is funded by the National Science Foundation (NSF) under grant NSF IIS1910794, Amazon Research Award, and Adobe gift fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Liu .

Editor information

Editors and Affiliations

Appendices

A Notation Systems

  • \((\mathcal {S}, \mathcal {A}, \mathcal {P}, r,\gamma )\) with state space \(\mathcal {S}\), action space \(\mathcal {A}\), the transition kernel \(\mathcal {P}\), the reward function r, the initial state \(S_0\) and its distribution \(\mu _{0}\), and the discounted factor \(\gamma \).

  • \(r_{\max } > 0\) is a constant as the upper bound of the reward.

  • State value function \(V_{\pi }(s)\) and state-action value function \(Q_{\pi }(s,a)\).

  • The normalized state and state action occupancy measure of policy \(\pi \) is denoted by \(\nu _\pi (s)\) and \(\sigma _\pi (s,a)\)

  • T is the length of a trajectory.

  • The return is defined as G. \(J(\pi )\) is the expectation of G.

  • Policy \(\pi _\theta \) is parameterized by the parameter \(\theta \).

  • \(\tau \) is the temperature parameter in the softmax parameterization of the policy.

  • \(F(\theta )\) is the Fisher information matrix.

  • \(\eta _TD\) is the learning rate of TD update. Similarly, \(\eta _NPG\) is the learning rate of NPG update. \(\eta _PPO\) is the learning rate of PPO update.

  • \(\beta \) is the penalty factor of KL difference in PPO update.

  • \(f\big ((s,a);\theta \big )\) is the two-layer over-parameterized neural network, with m as its width.

  • \(\phi _\theta \) is the feature mapping of the neural network.

  • \(\mathcal {D}\) is the parameter space for \(\theta \), with \(\varUpsilon \) as its radius.

  • \(M >0\) is a constant as the initialization upper bound on \(\theta \).

  • \(J^G_\lambda (\pi )\) is the mean-variance objective function.

  • \(J_\lambda (\pi )\) is the reward-volatility objective function, with \(\lambda \) as the penalty factor.

  • \(J_\lambda ^y(\pi )\) is the transformed reward-volatility objective function, with y as the auxiliary variable.

  • \(\tilde{r}\) is the reward for the augmented MDP. Similarly, \(\tilde{V}_\pi (s)\) and \(\tilde{Q}_\pi (s,a)\) are state value function and state-action value function of the augmented MDP, respectively. \(\tilde{J}(\pi )\) is the risk-neural objective of the augmented MDP.

  • \(\hat{y}_{k}\) is an estimator of y at k-th iteration.

  • \(\omega \) is the parameter of critic network.

  • \(\delta _k=\text {argmin}_{\delta \in \mathcal {D}}\Vert \hat{F}(\theta _k)\delta -\tau _k\hat{\nabla }_\theta J(\pi _{\theta _k} )\Vert _2\).

  • \(\xi _k(\delta )=\hat{F}(\theta _k)\delta -\tau _k\hat{\nabla }_\theta \tilde{J}(\pi _{\theta _k})-\mathbb {E}[\hat{F}(\theta _k)\delta -\tau _k\hat{\nabla }_\theta \tilde{J}(\pi _{\theta _k} )]\).

  • \(\sigma _\xi \) is a constant associated with the upper bound of the gradient variance.

  • \(\varphi _k,\psi _k,\varphi '_k,\psi '_k\) are the concentability coefficients, upper bounded by a constant \(c_0 > 0\).

  • \(\varphi ^*_{k} = \mathbb {E}_{(s,a) \sim \sigma _\pi }\bigg [\big (\frac{d\pi ^*}{d\pi _0}-\frac{d\pi _{\theta _k}}{d\pi _0}\big )^2\bigg ]^{1/2}\).

  • \(\psi ^*_{k} = \mathbb {E}_{(s,a) \sim \sigma _\pi }\bigg [\big (\frac{d\sigma _{\pi ^*}}{d\sigma _\pi }-\frac{d\nu _{\pi ^*}}{d\nu _\pi }\big )^2\bigg ]^{1/2}\).

  • K is the total number of iterations. Similarly, \(K_\textrm{TD}\) is the total number of TD iterations.

  • \(c_3>0\) is a constant as to quantify the difference in risk-neutral objective between optimal policy and any policy.

figure b

B Algorithm Details

We provide a comparison between MVPI and TOPS. Note that neither NPG nor PPO solve \(\theta _{k}:=\arg \max _\theta ( \tilde{J}(\pi _{\theta _k}) )\) directly, but instead solve an approximation optimization problem at each iteration. We provide pseudo-code for the implementation of MVPI and VARAC in Algorithm 3 and .

C Experimental Details

Note that although the mean-volatility method can be adapted to off-policy methods [60], in this paper, for the ease of the theoretical analysis, our proposed method is an on-policy actor-critic algorithm.

1.1 C.1 Testbeds

We use six Mujoco tasks from Open AI gym [8] as testbeds. They are HalfCheetah-v2, Hopper-V2, Swimmer-V2, Walker2d-V2, InvertedPendulum-v2, and InvertedDoublePendulum-v2.

1.2 C.2 Hyper-parameter Settings

In the experiment we set \(\lambda = 1\). We then tune learning rate for different algorithms. For MVP, we use the same setting as [60]. For MVPI, TOPS and VARAC with neural NPG, we tune the learning rate of the actor network from \(\{0.1, 1\times 10^{-2}, 1\times 10^{-3}, 7\times 10^{-4}\}\) and the learning rate of the critic network from \(\{1\times 10^{-2}, 1\times 10^{-3}, 7\times 10^{-4}\}\). For MVPI, TOPS and VARAC with neural PPO, we tune the learning rate of the actor network from \(\{3\times 10^{-3}, 3\times 10^{-4}, 3\times 10^{-5}\}\) and the learning rate of the critic network from \(\{1\times 10^{-2}, 1\times 10^{-3}, 1\times 10^{-4}\}\).

figure c
figure d
Fig. 4.
figure 4

A block diagram of over-parameterized neural network

1.3 C.3 Computing Infrastructure

We conducted our experiments on a GPU GTX 970 and GPU GTX 1080Ti.

D Theoretical Analysis Details

In this section, we discuss the theoretical analysis in detail. We first present the overview in Sect. D.1. Then we provide additional assumptions in Sect. D.2. In the rest of the section, we present all the supporting lemmas and the proof for Theorem 1 and 2.

Fig. 5.
figure 5

A flow chart of the theoretical analysis

1.1 D.1 Overview

We provide Fig. 5 to illustrate the structure of the theoretical analysis. First, under Assumption 3 and 4, as well as Lemma 13. We can obtain Lemma 1415 and 16. These are the building blocks of Lemma 2, which is a shared component in the analysis of both NPG and PPO. The shared components also include Lemma 3, as well as Lemma 4 obtained under Assumption 5. For PPO analysis, under Assumption 2 and 4, we obtain Lemma 7 and 8 from Lemma 2 and 6, Then combined with Lemma 34 and 9, we obtain Theorem 1, the major result of PPO analysis. Likely for NPG analysis, we first obtain Lemma 11 and 12 under Assumption 12 and 4. Then together with Lemma 234 and 10, we obtain Theorem 2, the major result of NPG analysis.

1.2 D.2 Additional Assumptions

Assumption 3

(Action-value function class). We define

$$\begin{aligned}&\mathcal {F}_{\varUpsilon ,\infty } := \Bigg \{f(s,a;\theta )=f_0(s,a)) \\&+ \int \mathbbm {1}\{\theta ^\top (s,a)>0\} (s,a)^\top \iota (\theta )d\mu (w):\Vert \iota (\theta )\Vert _\infty \le \varUpsilon /\sqrt{d}\Bigg \} \end{aligned}$$

where \(\mu :\mathbb {R}^d \rightarrow [0,1] \) is a probability density function of \(\mathcal {N}(0,I_d/d)\). \(f_0(s,a)\) is the two-layer neural network corresponding to the initial parameter \(\varTheta _{\textrm{init}}\), and \(\iota :\mathbb {R}^d \rightarrow \mathbb {R}^d \) is a weighted function. We assume that \(\tilde{Q}_\pi \in \mathcal {F}_{\varUpsilon ,\infty }\) for all \(\pi \).

Assumption 4

(Regularity of stationary distribution). For any policy \(\pi \), and \(\forall x \in \mathbb {R}^d, \forall \Vert x\Vert _2=1\), and \(\forall u>0\), we assume that there exists a constant \(c > 0\) such that \( \mathbb {E}_{(s,a) \sim \sigma _\pi }\big [\mathbbm {1}\{|x^\top (s,a)|\le u\}\big ]\le c u. \)

Assumption 3 is a mild regularity condition on \(Q_\pi \), as \(\mathcal {F}_{\varUpsilon ,\infty }\) is a sufficiently rich function class and approximates a subset of the reproducing kernel Hilbert space (RKHS) [40]. Similar assumptions are widely imposed [4, 16, 38, 51, 58]. Assumption 4 is a regularity condition on the transition kernel \(\mathcal {P}\). Such regularity holds so long as \(\sigma _\pi \) has an upper bound density, satisfying most Markov chains.

In [62] Lemma 4.15, they make a mistake in the proof. They accidentally flip a sign in \(y^*-\bar{y}\) when transitioning from the first equation in the proof to Eq. (4.15). This invalidates the conclusion in Eq. (4.17), an essential part of the proof. We tackle this issue by proposing the next assumption.

Assumption 5

(Convergence Rate of \(J(\pi )\)). We assume \(\pi ^*\) (the optimal policy to the risk-averse objective function \(J_\lambda (\pi )\)) converges to the risk-neutral objective \(J(\pi )\) for both NPG and PPO with the over-parameterized neural network to be \(\mathcal {O}(1/\sqrt{k})\). Specifically, there exists a constant \(c_3>0\) such that,

$$\begin{aligned} J(\pi ^*) - J(\pi _k) \le \frac{c_3}{\sqrt{k}} \end{aligned}$$

It was proved [31, 51] that the optimal policy w.r.t the risk-neutral objective \(J(\pi )\) obtained by NPG and PPO method with the over-parameterized two-layer neural network converges to the globally optimal policy at a rate of \(\mathcal {O}(1/\sqrt{K})\), where K is the number of iteration. Since our method uses similar settings, we assume the convergence rates of risk-neutral objective \(J(\pi )\) in our paper follow their results.

In the following subsections, we study TOPS’s convergence of global optimality and provide a proof sketch.

1.3 D.3 Proof of Theorem 1

We first present the analysis of policy evaluation error, which is induced by TD update in Line 9 of Algorithm 1. We characterize the policy evaluation error in the following lemma:

Lemma 2

(Policy Evaluation Error). We set learning rate of TD \(\eta _{\text {TD}} = \min \{(1-\gamma )/3(1+\gamma )^2, 1/\sqrt{K_{\textrm{TD}}}\}\). Under Assumption 3 and 4, it holds that, with probability of \(1-\delta \),

$$\begin{aligned}&\Vert \tilde{Q}_{\omega _k}-\tilde{Q}_{\pi _k}\Vert ^2_{\nu _{\pi _k}} \nonumber \\&= \mathcal {O}(\varUpsilon ^{3}m^{-1/2}\log (1/\delta )+\varUpsilon ^{5/2}m^{-1/4}\sqrt{\log (1/\delta )}\nonumber \\&+\varUpsilon r_{\max }^2m^{-1/4}+\varUpsilon ^2K_{\textrm{TD}}^{-1/2}+\varUpsilon ), \end{aligned}$$
(14)

where \(\tilde{Q}_{\pi _k}\) is the Q-value function of the augmented MDP, and \(\tilde{Q}_{\omega _k}\) is its estimator at the k-th iteration. We provide the proof and its supporting lemmas in Appendix D.6. In the following, we establish the error induced by the policy update. Equation (8) can be re-expressed as

$$\begin{aligned} J_\lambda ^y(\pi )&= \sum _{s, a} \sigma _\pi \big (r_{s,a} -\lambda r_{s,a}^2 + 2\lambda r_{s,a}{y_{k + 1}}\big ) - \lambda y_{k+1}^2 \end{aligned}$$
(15)

It can be shown that \(\forall \pi , \max _{y}J_\lambda ^y (\pi ) = J_\lambda (\pi ) \) [55, 60]. We denote the optimal policy to the augmented MDP associated with \(y^*\) by \(\pi ^*(y^*)\). By definition, it is obvious that \(\pi ^*\) and \(\pi ^*(y^*)\) are equivalent. For simplicity, we will use the unified term \(\pi ^*\) in the rest of the paper. We present Lemma 3 and 4.

Lemma 3

(Policy’s Performance Difference). For mean-volatility objective w.r.t. auxiliary variable y as \(J^y_\lambda (\pi )\) defined in Eq. (15). For any policy \(\pi \) and \(\pi '\), we have the following,

$$\begin{aligned} J^y_\lambda (\pi ') - J^y_\lambda (\pi )&= (1-\gamma )^{-1}\mathbb {E}_{s \sim \nu _{\pi '}}\big [\mathbb {E}_{a \sim \pi '}[\tilde{Q}_{\pi ,y} ]\\&-\mathbb {E}_{a \sim \pi }[\tilde{Q}_{\pi ,y} ]\big ], \end{aligned}$$

where \(\tilde{Q}_{\pi ,y} \) is the state-action value function of the augmented MDP, and its rewards are associated with y.

Proof

When y is fixed,

$$\begin{aligned}&J^y_\lambda (\pi ') - J^y_\lambda (\pi ) \nonumber \\&= \sum _{s, a} \sigma _{\pi '}\tilde{r}_{s,a} - \sum _{s, a} \sigma _\pi \tilde{r}_{s,a} = \tilde{J} (\pi ')) - \tilde{J} (\pi ) \end{aligned}$$
(16)

We then follow Lemma 6.1 in [21]:

$$\begin{aligned} \tilde{J} (\pi ') - \tilde{J} (\pi ) = (1-\gamma )^{-1}\mathbb {E}_{(s,a) \sim \sigma _{\pi '}}\left[ \tilde{A}_\pi \right] \end{aligned}$$
(17)

where \(\tilde{A}_\pi = \tilde{Q}_\pi - \tilde{V}_\pi \) is the advantage function of policy \(\pi \). Meanwhile,

$$\begin{aligned} \mathbb {E}_{a \sim \pi '}[\tilde{A}_\pi ]&= \mathbb {E}_{a \sim \pi '}[\tilde{Q}_\pi ] - \tilde{V}_\pi = \mathbb {E}_{a \sim \pi '}[\tilde{Q}_\pi ] - \mathbb {E}_{a \sim \pi }[\tilde{Q}_\pi ] \end{aligned}$$
(18)

From Eq. (16), Eq. (17) and Eq. (18), we complete the proof.

Lemma 3 is inspired by [21] and adopted by most work on global convergence [1, 31, 57]. Next, we derive an upper bound for the error of the critic update in Line 5 of Algorithm 1:

Lemma 4

(y Update Error). We characterize the error induced by the estimation of auxiliary variable y w.r.t the optimal value \(y^*\) at k-th iteration as, \( J^{y^*}_\lambda (\pi ^*)-J^{\hat{y}_k}_\lambda (\pi ^*) = \frac{2c_3 r_{\max }(1-\gamma )\lambda }{\sqrt{k}}, \) where \(r_{\max }\) is the bound of the original reward, and \(c_3\) is a constant error term.

Proof

We start from the subproblem objective defined in Eq. (15) with \(y^*\) and \(\hat{y}_k\):

$$\begin{aligned}&J^{y^*}_\lambda (\pi ^*)-J^{\hat{y}_k}_\lambda (\pi ^*) \\&= \bigg (\sum _{s, a} \sigma _{\pi ^*} \big (r_{s,a}-\lambda r^2_{s,a}+ 2\lambda r_{s,a}{y^*}\big ) - \lambda y^*{}^2 \bigg ) \\&- \bigg (\sum _{s, a} \sigma _{\pi ^*} \big (r_{s,a}-\lambda r^2_{s,a}+ 2\lambda r_{s,a}{\hat{y}_k}\big ) - \lambda \hat{y}_k^2\bigg ) \\&= 2\lambda \big (\sum _{s,a}\sigma _{\pi ^*}r_{s,a}\big )(y^*-\hat{y}_k) - \lambda (y^*{}^2-\hat{y}_k^2) \\&= \lambda \langle y^*-\hat{y}_k, 2(1-\gamma )J(\pi ^*)-y^*-\hat{y}_k\rangle \\&= (1-\gamma )\lambda \langle y^*-\hat{y}_k, J(\pi ^*)-\hat{J}(\pi _k)\rangle \end{aligned}$$

where we obtain the final two equalities by the definition of \(J_\pi \) and y. Because \(r_{s,a}\) is upper-bounded by a constant \(r_{\max }\), we have \(|y^*-\hat{y}_k | \le 2r_{\max }\). Under Assumption 5 we have,

$$\begin{aligned} J^{y^*}_\lambda (\pi ^*)-J^{\hat{y}_k}_\lambda (\pi ^*) = \frac{2 c_3 r_{\max }(1-\gamma )\lambda }{\sqrt{k}} \end{aligned}$$

Thus we finish the proof.

From Lemma 3 and 4, we can also obtain the following Lemma.

Lemma 5

(Performance Difference on \(\pi \) and y). For mean-volatility objective w.r.t. auxiliary variable y as \(J^y_\lambda (\pi )\) defined in Eq. (15). For any \(\pi ,y\) and the optimal \(\pi *,y*\), we have the following,

$$\begin{aligned} J^{y^*}_\lambda (\pi ^*) - J^y_\lambda (\pi )&= (1-\gamma )^{-1}\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi ,y} ]\\&-\mathbb {E}_{a \sim \pi }[\tilde{Q}_{\pi ,y} ]\big ] + \frac{2 c_3 r_{\max }(1-\gamma )\lambda }{\sqrt{k}}. \end{aligned}$$

where \(\tilde{Q}_{\pi ,y} \) is the state-action value function of the augmented MDP, and its rewards are associated with y.

Proof

It is easy to see that \(J^{y^*}_\lambda (\pi ^*) - J^y_\lambda (\pi ) = J^{y^*}_\lambda (\pi ^*) - J^y_\lambda (\pi ^*) + J^y_\lambda (\pi ^*) - J^{y}_\lambda (\pi )\). Then replace \(J^y_\lambda (\pi ^*) - J^{y}_\lambda (\pi )\) with Lemma 3 and \(J^{y^*}_\lambda (\pi ^*) - J^y_\lambda (\pi ^*)\) with Lemma 4, we finish the proof.

Lemma 5 quantifies the performance difference of \(J^{y}_\lambda (\pi )\) between any pair \(\pi ,y\) and the optimal \(\pi *,y*\), while Lemma 3 only quantifies the performance difference of \(J^{y}_\lambda (\pi )\) between \(\pi \) and \(\pi '\) when y is fixed.

We now study the global convergence of TOPS with neural PPO as the policy update component. First, we define the neural PPO update rule.

Lemma 6

[31]. Let \(\pi _{\theta _k} \propto \exp \{\tau ^{-1}_k f_{\theta _k}\}\) be an energy-based policy. We define the update

$$\hat{\pi }_{k+1} = \arg {\max _\pi }\mathbb {E}_{s\sim \nu _k}[\mathbb {E}_{\pi }[Q_{\omega _k}] - \beta _k \text {KL}(\pi _\theta \Vert \pi _{\theta _k})],$$

where \(Q_{\omega _k}\) is the estimator of the exact action-value function \(Q^{\pi _{\theta _k}}\). We have

$$\begin{aligned} \hat{\pi }_{k+1} \propto \exp \{\beta ^{-1}_k Q_{\omega _k} + \tau ^{-1}_k f_{\theta _k}\} \end{aligned}$$

And to represent \(\hat{\pi }_{k+1}\) with \(\pi _{\theta _{k+1}} \propto \exp \{\tau ^{-1}_{k+1} f_{\theta _{k+1}}\}\), we solve the following subproblem,

$$\begin{aligned} \theta _{k+1}&= \arg {\min _{\theta \in \mathbb {D}}}\mathbb {E}_{(s,a)\sim \sigma _k}[(f_\theta (s,a)- \tau _{k+1}(\beta ^{-1}_k Q_{\omega _k}(s,a)\nonumber \\&+ \tau ^{-1}_k f_{\theta _k}(s,a)))^2] \end{aligned}$$

We analyze the policy improvement error in Line 13 of Algorithm 1. [31] proves that the policy improvement error can be characterized similarly to the policy evaluation error as in Eq. (14). Recall \(\tilde{Q}_{\omega _k}\) is the estimator of Q-value, \(f_{\theta _k}\) the energy function for policy, and \(f_{\hat{\theta }}\) its estimator. We characterize the policy improvement error as follows: Under Assumptions 3 and 4, we set the learning rate of PPO \(\eta _{\textrm{PPO}}=\min \{(1-\gamma )/3(1+\gamma )^2 1/\sqrt{K_{\textrm{TD}}}\}\), and with a probability of \(1-\delta \):

$$\begin{aligned}&\Vert (f_{\hat{\theta }} - \tau _{k+1}(\beta ^{-1} \tilde{Q}_{\omega _k} +\tau ^{-1}_k f_{\theta _k})\Vert ^2_{} \nonumber \\&= \mathcal {O}(\varUpsilon ^{3}m^{-1/2}\log (1/\delta )+\varUpsilon ^{5/2}m^{-1/4}\sqrt{\log (1/\delta )}\nonumber \\&+\varUpsilon r_{\max }^2m^{-1/4}+\varUpsilon ^2K_{\textrm{TD}}^{-1/2}+\varUpsilon ). \end{aligned}$$
(19)

We quantify how the errors propagate in neural PPO [31] in the following.

Lemma 7

[31]. (Error Propagation) We have,

$$\begin{aligned}&\big |\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\log (\pi _{\theta _{k+1}}/\pi _{k+1}]-\mathbb {E}_{a \sim \pi _{\theta _k}} \nonumber \\&[\log (\pi _{\theta _{k+1}}/\pi _{k+1}]\big ]\big |\le \tau ^{-1}_{k+1}\varepsilon ''_{k}\varphi ^*_{k+1} + \beta ^{-1}\varepsilon ''_{k}\psi ^*_{k} \end{aligned}$$
(20)

\(\varepsilon ''_{k}\) are defined in Eq. (14) as well as Eq. (19). \(\varphi ^*_{k} = \mathbb {E}_{(s,a) \sim \sigma _\pi }\bigg [\big (\frac{d\pi ^*}{d\pi _0}-\frac{d\pi _{\theta _k}}{d\pi _0}\big )^2\bigg ]^{1/2}, \psi ^*_{k} = \mathbb {E}_{(s,a) \sim \sigma _\pi }\bigg [\big (\frac{d\sigma _{\pi ^*}}{d\sigma _\pi }-\frac{d\nu _{\pi ^*}}{d\nu _\pi }\big )^2\bigg ]^{1/2}\). \(\frac{d\pi ^*}{d\pi _0},\frac{d\pi _{\theta _k}}{d\pi _0},\frac{d\sigma _{\pi ^*}}{d\sigma _\pi },\frac{d\nu _{\pi ^*}}{d\nu _\pi }\) are the Radon-Nikodym derivatives [23]. We denote RHS in Eq. (20) by \(\varepsilon _k = \tau ^{-1}_{k+1}\varepsilon ''_{k}\varphi ^*_{k+1} + \beta ^{-1}\varepsilon ''_{k}\psi ^*_{k}\). Lemma 7 essentially quantifies the error from which we use the two-layer neural network to approximate the action-value function and policy instead of having access to the exact ones. Please refer to [31] for complete proofs of Lemma 6 and 7.

$$\begin{aligned}&\big |\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\log (\pi _{\theta _{k+1}}/\pi _{k+1}] \nonumber \\&-\mathbb {E}_{a \sim \pi _{\theta _k}}[\log (\pi _{\theta _{k+1}}/\pi _{k+1}]\big ]\big | \le \tau ^{-1}_{k+1}\varepsilon ''_{k}\varphi ^*_{k+1} + \beta ^{-1}\varepsilon ''_{k}\psi ^*_{k} \end{aligned}$$

We then characterize the difference between energy functions in each step [31]. Under the optimal policy \(\pi *\),

Lemma 8

[31]. (Stepwise Energy Function difference) Under the same condition of Lemma 7, we have

$$\begin{aligned} \mathbb {E}_{s \sim \nu _{\pi ^*}}[\Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}\Vert ^2_\infty ] \le 2\varepsilon '_k+2\beta ^{-2}_k U, \end{aligned}$$
(21)

where \(\varepsilon '_k = |\mathcal {A}|\tau ^{-2}_{k+1}\epsilon ^2_{k+1}\)

and \(U = 2\mathbb {E}_{s \sim \nu _{\pi ^*}}[\max _{a\in \mathcal {A}}(\tilde{Q}_{\omega _{0}})^2] + 2\varUpsilon ^2\).

Proof

By the triangle inequality, we get the following,

$$\begin{aligned}&\Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}\Vert ^2_\infty \nonumber \\ \le&\, 2\big ( \Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}-\beta ^{-1}\tilde{Q}_{\omega _k}\Vert ^2_\infty + \Vert \beta ^{-1}\tilde{Q}_{\omega _k}\Vert ^2_\infty \big ) \end{aligned}$$
(22)

We take the expectation of both sides of Eq. (22) with respect to \(s\sim \nu _{\pi ^*}\). With the 1-Lipshitz continuity of \(\tilde{Q}_{\omega _k}\) in \(\omega \) and \(\Vert \omega _k-\varTheta _\textrm{init}\Vert _2 \le \varUpsilon \), we have,

$$\begin{aligned}&\mathbb {E}_{\nu _{\pi ^*}}\big [\Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}\Vert ^2_\infty \big ] \\ \le&\, 2(|\mathcal {A}|\tau ^{-2}_{k+1}\epsilon ^2_{k+1} + \mathbb {E}_{s \sim \nu _{\pi ^*}}[\max _{a\in \mathcal {A}}(\tilde{Q}_{\omega _{0}})^2] + \varUpsilon ^2) \end{aligned}$$

Thus complete the proof.

We then derive a difference term associated with \(\pi _{k+1}\) and \(\pi _{\theta _k}\), where at the k-th iteration \(\pi _{k+1}\) is the solution for the following subproblem,

$$\begin{aligned} \pi _{k+1}=\arg {\max _\pi }\Big (\mathbb {E}_{s \sim \nu _{\pi _k}}\big [\mathbb {E}_{a \sim \pi }[\tilde{Q}_{\pi _k,\hat{y}_k}]-\beta \text {KL}(\pi \Vert \pi _{\theta _k})\big ]\Big ) \end{aligned}$$

and \(\pi _{\theta _k}\) is the policy parameterized by the two-layered over-parameterized neural network. The following lemma establishes the one-step descent of the KL-divergence in the policy space:

Lemma 9

(One-step difference of \(\pi \)). For \(\pi _{k+1}\) and \(\pi _{\theta _k}\), we have

$$\begin{aligned}&\text {KL}(\pi ^*\Vert \pi _{\theta _{k}})-\text {KL}(\pi ^*\Vert \pi _{\theta _{k+1}})\nonumber \\ \ge&\, \big (\mathbb {E}_{a \sim \pi ^*}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{k+1}})]- \mathbb {E}_{a \sim \pi _{\theta _{k}}}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{k+1}})]\big )\nonumber \\&+ \beta ^{-1}\big (\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big )\nonumber \\&+ \frac{1}{2}\Vert \pi _{\theta _{k+1}} - \pi _{\theta _{k}}\Vert ^2_1 + \big (\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}] \nonumber \\&-\mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}]\big ) \end{aligned}$$
(23)

Proof

We start from

$$\begin{aligned}&\text {KL}(\pi ^*\Vert \pi _{\theta _{k}})-\text {KL}(\pi ^*\Vert \pi _{\theta _{k+1}}) = \mathbb {E}_{a \sim \pi ^*}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})] \nonumber \\&\text {(By definition, KL}(\pi _{\theta _{k+1}}\Vert \pi _{\theta _{k}})= \mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})]\big )) \nonumber \\&= \big (\mathbb {E}_{a \sim \pi ^*}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})]-\mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})]\big ) + \nonumber \\&\text {KL}(\pi _{\theta _{k+1}}\Vert \pi _{\theta _{k}}) \nonumber \\&\text {We then add and subtract terms, }\nonumber \\&=\mathbb {E}_{a \sim \pi ^*}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})]-\mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})] + \text {KL}\nonumber \\&(\pi _{\theta _{k+1}}\Vert \pi _{\theta _{k}})+ \beta ^{-1}\big (\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big ) \nonumber \\&- \beta ^{-1}\big (\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big ) \nonumber \\&+ \mathbb {E}_{a \sim \pi _{\theta _{k}}}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})]-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\log ({\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})] \nonumber \\&\text {Rearrange the terms and we get, }\nonumber \\&= \big (\mathbb {E}_{a \sim \pi ^*}[\log (\pi _{\theta _{k+1}})-\log (\pi _{\theta _{k}})-\beta ^{-1} \tilde{Q}_{\pi _k,\hat{y}_k}]\nonumber \\&- \mathbb {E}_{a \sim \pi _{\theta _{k}}}[\log (\pi _{\theta _{k+1}})-\log (\pi _{\theta _{k}})-\beta ^{-1} \tilde{Q}_{\pi _k,\hat{y}_k}]\big )\nonumber \\&+ \beta ^{-1}\big (\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big )+ \text {KL}\nonumber \\&(\pi _{\theta _{k+1}}\Vert \pi _{\theta _{k}}) + \big (\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})]-\mathbb {E}_{a \sim \pi _{\theta _{k+1}}}\nonumber \\&[\log ({\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})]\big ) \end{aligned}$$
(24)

Recall that \(\pi _{k+1} \propto \exp \{\tau ^{-1}_k f_{\theta _k}+\beta ^{-1} \tilde{Q}^y_{\pi _k}\}\). We define the two normalization factors associated with ideal improved policy \(\pi _{k+1}\) and the current parameterized policy \(\pi _{\theta _k}\) as,

$$\begin{aligned}&Z_{k+1}(s) := \sum _{a'\in \mathcal {A}}\exp \{\tau ^{-1}_k f_{\theta _k}(s,a')+\beta ^{-1} \tilde{Q}^y_{\pi _k}(s,a')\}\\&Z_{\theta _{k+1}}(s) := \sum _{a'\in \mathcal {A}}\exp \{\tau ^{-1}_{k+1} f_{\theta _{k+1}}(s,a')\} \end{aligned}$$

We then have,

$$\begin{aligned}&\pi _{k+1}(a|s) = \frac{\exp \{\tau ^{-1}_k f_{\theta _k}(s,a)+\beta ^{-1} \tilde{Q}^y_{\pi _k}(s,a)\}}{Z_{k+1}(s)},\end{aligned}$$
(25)
$$\begin{aligned}&\pi _{\theta _{k+1}}(a|s) = \frac{\exp \{\tau ^{-1}_{k+1} f_{\theta _{k+1}}(s,a)\}}{Z_{\theta _{k+1}}(s)} \end{aligned}$$
(26)

For any \(\pi , \pi '\) and k, we have,

$$\begin{aligned} \mathbb {E}_{a \sim \pi }[\log Z_{\theta _{k+1}}]-\mathbb {E}_{a \sim \pi '}[\log Z_{\theta _{k+1}}]= 0\end{aligned}$$
(27)
$$\begin{aligned} \mathbb {E}_{a \sim \pi }[\log Z_{k+1}]-\mathbb {E}_{a \sim \pi '}[\log Z_{k+1}] = 0 \end{aligned}$$
(28)

Now we look back at a few terms on RHS from Eq. (24):

$$\begin{aligned}&\mathbb {E}_{a \sim \pi ^*}\big [\log (\pi _{\theta _{k}})+\beta ^{-1} \tilde{Q}_{\pi _k,\hat{y}_k}\big ]\nonumber \\&- \mathbb {E}_{a \sim \pi _{\theta _{k}}}\big [\log (\pi _{\theta _{k}})+\beta ^{-1} \tilde{Q}_{\pi _k,\hat{y}_k}\big ]\nonumber \\ =&\,\big (\mathbb {E}_{a \sim \pi ^*}[\tau ^{-1}_k f_{\theta _k}+\beta ^{-1} \tilde{Q}_{\pi _k,\hat{y}_k}-\log Z_{\theta _{k+1}}]\nonumber \\&-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tau ^{-1}_k f_{\theta _k}+\beta ^{-1} \tilde{Q}_{\pi _k,\hat{y}_k}-\log Z_{\theta _{k+1}}]\big ) \nonumber \\ =&\,\mathbb {E}_{a \sim \pi ^*}\Big [\log \frac{\exp \{\tau ^{-1}_k f_{\theta _k}+\beta ^{-1} \tilde{Q}_{\pi _k,\hat{y}_k}\}}{Z_{k+1}}\Big ]\nonumber \\&-\mathbb {E}_{a \sim \pi _{\theta _{k}}}\Big [\log \frac{\exp \{\tau ^{-1}_k f_{\theta _k}+\beta ^{-1} \tilde{Q}_{\pi _k,\hat{y}_k}\}}{Z_{k+1}}\Big ]\nonumber \\ =&\,\mathbb {E}_{a \sim \pi ^*}[\log \pi _{k+1}]- \mathbb {E}_{a \sim \pi _{\theta _{k}}}[\log \pi _{k+1}] \end{aligned}$$
(29)

For Eq. (29), we obtain the first equality by Eq. (26). Then, by swapping Eq. (27) with Eq. (28), we obtain the second equality. We achieve the concluding step with the definition in Eq. (25). Following a similar logic, we have,

$$\begin{aligned}&\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})-\mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\log (\frac{\pi _{\theta _{k+1}}}{\pi _{\theta _{k}}})]\nonumber \\ =&\, \mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}}-\log Z_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}+\log Z_{\theta _{k}}]-\nonumber \\&\mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}} - \log Z_{\theta _{k+1}} - \tau ^{-1}_{k} f_{\theta _{k}}+\log Z_{\theta _{k}}] \nonumber \\ =&\,\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}} - \tau ^{-1}_{k} f_{\theta _{k}}] -\mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}} - \nonumber \\&\tau ^{-1}_{k} f_{\theta _{k}}] \end{aligned}$$
(30)

Finally, by using the Pinsker’s inequality [12], we have,

$$\begin{aligned} \text {KL}(\pi _{\theta _{k+1}}\Vert \pi _{\theta _{k}}) \ge 1/2 \Vert \pi _{\theta _{k+1}} - \pi _{\theta _{k}}\Vert ^2_1 \end{aligned}$$
(31)

Plugging Eqs. (29), (30), and (31) into Eq. (24), we have

$$\begin{aligned}&\text {KL}(\pi ^*\Vert \pi _{\theta _{k}})-\text {KL}(\pi ^*\Vert \pi _{\theta _{k+1}})\\ \ge&\, \big (\mathbb {E}_{a \sim \pi ^*}[\log (\pi _{\theta _{k+1}})-\log (\pi _{k+1})]- \mathbb {E}_{a \sim \pi _{\theta _{k}}}[\log (\pi _{\theta _{k+1}}) \\&-\log (\pi _{k+1})]\big )+ \beta ^{-1}\big (\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big ) \\&+ \frac{1}{2}\Vert \pi _{\theta _{k+1}} - \pi _{\theta _{k}}\Vert ^2_1 + \big (\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}] \\&- \mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}]\big ) \end{aligned}$$

Rearranging the terms, we obtain Lemma 9.

Lemma 9 serves as an intermediate-term for the major result’s proof. We obtain upper bounds by telescoping this term in Theorem 1. Now we are ready to present the proof for Theorem 1.

Proof

First we take expectation of both sides of Eq. (23) with respect to \(s\sim \nu _{\pi ^*}\) from Lemma 9 and insert Eq. (20) to obtain,

$$\begin{aligned}&\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{\theta _{k+1}})] - \mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{\theta _{k}})] \nonumber \\ \le&\, \varepsilon _k - \beta ^{-1}\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big ] \nonumber \\&- 1/2\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\Vert \pi _{\theta _{k+1}} - \pi _{\theta _{k}}\Vert ^2_1\big ] - \mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi _{\theta _{k}}}\\&[\tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}]- \mathbb {E}_{a \sim \pi _{\theta _{k+1}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}} -\tau ^{-1}_{k} f_{\theta _{k}}]\big ] \nonumber \end{aligned}$$
(32)

Then, by Lemma 3, we have,

$$\begin{aligned}&\beta ^{-1}\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big ] \nonumber \\&=\beta ^{-1}(1-\gamma )\big (J^{\hat{y}_k}_\lambda (\pi ^*) - J^{\hat{y}_k}_\lambda (\pi )\big ) \end{aligned}$$
(33)

And with Hölder’s inequality, we have,

$$\begin{aligned}&\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi _{\theta _{k}}}[\tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}]- \mathbb {E}_{a \sim \pi _{\theta _{k+1}}} \nonumber \\&[\tau ^{-1}_{k+1} f_{\theta _{k+1}}-\tau ^{-1}_{k} f_{\theta _{k}}]\big ] \nonumber \\ =&\,\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\big \langle \tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}, \pi _{\theta _{k}}-\pi _{\theta _{k+1}} \big \rangle \big ]\\ \le&\, \mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}\Vert _{\infty } \Vert \pi _{\theta _{k}}-\pi _{\theta _{k+1}}\Vert _1\big ] \nonumber \end{aligned}$$
(34)

Insert Eqs. (33) and (34) into Eq. (32), we have,

$$\begin{aligned}&\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{\theta _{k+1}})] - \mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{\theta _{k}})] \\ \le&\, \varepsilon _k - (1-\gamma )\beta ^{-1}\big (J^{\hat{y}_k}_\lambda (\pi ^*) - J^{\hat{y}_k}_\lambda (\pi )\big ) - 1/2\mathbb {E}_{s \sim \nu _{\pi ^*}} \\&\big [\Vert \pi _{\theta _{k+1}} - \pi _{\theta _{k}}\Vert ^2_1\big ] + \mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}\Vert _{\infty } \\&\Vert \pi _{\theta _{k}}-\pi _{\theta _{k+1}}\Vert _1\big ] \\ \le&\, \varepsilon _k - (1-\gamma )\beta ^{-1}\big (J^{y^*}_\lambda (\pi ^*) - J^{\hat{y}_k}_\lambda (\pi ) - J^{y^*}_\lambda (\pi ^*) \\&+ J^{\hat{y}_k}_\lambda (\pi ^*) \big )+ 1/2\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}\Vert ^2_{\infty }\big ]\\ \le&\, \varepsilon _k - (1-\gamma )\beta ^{-1}\big (J^{y^*}_\lambda (\pi ^*) - J^{\hat{y}_k}_\lambda (\pi )\big ) \\&+ (1-\gamma )\beta ^{-1}\big (J^{y^*}_\lambda (\pi ^*) - J^{\hat{y}_k}_\lambda (\pi ^*)\big ) \\&+ 1/2\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}\Vert ^2_{\infty }\big ]. \end{aligned}$$

The second inequality holds by using the inequality \(2AB - B^2\le A^2\), with a minor abuse of notations. Here, \(A := \Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}\Vert _{\infty }\) and \(B := \Vert \pi _{\theta _{k}}-\pi _{\theta _{k+1}}\Vert _1\). Then, by plugging in Lemma 4 and Eq. (21) we end up with,

$$\begin{aligned}&\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{\theta _{k+1}})] - \mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{\theta _{k}})] \nonumber \\ \le&\, \varepsilon _k - (1-\gamma )\beta ^{-1}\big (J^{y^*}_\lambda (\pi ^*) - J^{\hat{y}_k}_\lambda (\pi _k)\big ) \\&+(1-\gamma )\beta ^{-1}\big (\frac{2c_3M(1-\gamma )\lambda }{\sqrt{k}}\big ) + (\varepsilon '_k+\beta ^{-2}_k U) \nonumber \end{aligned}$$
(35)

Rearrange Eq. (35), we have

$$\begin{aligned}&(1-\gamma )\beta ^{-1}\big (J^{y^*}_\lambda (\pi ^*) - J^{\hat{y}_k}_\lambda (\pi _k)\big )\nonumber \\ \le&\,\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{\theta _{k}})]-\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{\theta _{k+1}})] \nonumber \\&+\big (\frac{2c_3M(1-\gamma )^2\lambda }{\beta \sqrt{k}}\big ) +\varepsilon _k+\varepsilon '_k+\beta ^{-2}_k U \end{aligned}$$
(36)

And then telescoping Eq. (36) results in,

$$\begin{aligned}&(1-\gamma )\sum ^{K}_{k=1}\beta ^{-1}\min _{k\in [K]}\big (J^{y^*}_{\lambda }(\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k)\big )\nonumber \\ \le&\, (1-\gamma )\sum ^{K}_{k=1}\beta ^{-1}\big (J^{y^*}_{\lambda }(\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k)\big )\nonumber \\ \le&\, \mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{0})]-\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{K})] \nonumber \\&+ \lambda r_{\max } (1-\gamma )^2\sum ^{K}_{k=1}\beta ^{-1}\big (\frac{2c_3}{\sqrt{k}}\big ) + U \sum ^{K}_{k=1}\beta ^{-2}_k \nonumber \\&+ \sum ^{K}_{k=1}(\varepsilon _k+\varepsilon '_k) \end{aligned}$$
(37)

We complete the final step in Eq. (37) by plugging in Lemma 4 and Eq. (20). Per the observation we make in the proof of Theorem 2,

  1. 1.

    \(\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{0})] \le \log \mathcal {A}\) due to the uniform initialization of policy.

  2. 2.

    \(\text {KL}(\pi ^*\Vert \pi _{K})\) is a non-negative term.

We now have,

$$\begin{aligned}&\min _{k\in [K]}J^{y^*}_{\lambda }(\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k) \\ \le&\, \frac{\log |\mathcal {A}|+ UK\beta ^{-2} + \sum ^{K}_{k=1}(\varepsilon _k+\varepsilon '_k}{(1-\gamma )K\beta ^{-1}})\\&+\lambda r_{\max } (1-\gamma )\big (\frac{2c_3}{\sqrt{k}}\big ) \end{aligned}$$

Replacing \(\beta \) with \(\beta _0\sqrt{K}\) finishes the proof.

1.4 D.4 Proof of Theorem 2

In the following part, we focus the convergence of neural NPG. We first define the following terms under neural NPG update rule.

Lemma 10

[51]. For energy-based policy \(\pi _\theta \), we have policy gradient and Fisher information matrix,

$$\begin{aligned} \nabla _{\theta } J(\pi _{\theta })&= \tau \mathbb {E}_{d_{\pi _\theta }(s,a)}[Q_{\pi _\theta }(s,a) (\phi _\theta (s,a)- \mathbb {E}_{\pi _\theta }[\phi _\theta (s,a')])] \\ F(\theta )&= \tau ^2 \mathbb {E}_{d_{\pi _\theta }(s,a)}[(\phi _\theta (s,a) - \mathbb {E}_{\pi _\theta }[\phi _\theta (s,a')])\nonumber \\&(\phi _\theta (s,a) - \mathbb {E}_{\pi _\theta }[\phi _\theta (s,a')])^\top ] \end{aligned}$$

We then derive an upper bound for \(J^{y^*}_{\lambda }(\pi ^*)-J^{y^*}_{\lambda }(\pi _k)\) for the neural NPG method in the following lemma:

Lemma 11

(One-step difference of \(\pi \)). It holds that, with probability of \(1-\delta \),

$$\begin{aligned}&(1-\gamma )\big (J^{\hat{y}_k}_{\lambda }(\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k)\big ) \nonumber \\ \le&\,\eta _{\textrm{NPG}}^{-1}\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [ \text {KL}(\pi ^*\Vert \pi _{k})-\text {KL}(\pi ^*\Vert \pi _{k+1})\big ] \nonumber \\&+\eta _{\textrm{NPG}} (9\varUpsilon ^2+r_{\max }^2)+2c_0\epsilon '_{k} + \eta _{\textrm{NPG}}^{-1}\epsilon ''_{k}, \end{aligned}$$
$$\begin{aligned} \text {where }&\\ \epsilon '_{k}&= \mathcal {O}(\varUpsilon ^{3}m^{-1/2}\log (1/\delta )+\varUpsilon ^{5/2}m^{-1/4}\sqrt{\log (1/\delta )} \\&+\varUpsilon r_{\max }^2m^{-1/4}+\varUpsilon ^2K_{\textrm{TD}}^{-1/2}+\varUpsilon ), \\ \epsilon ''_k&= 8 \eta _{\textrm{NPG}} \varUpsilon ^{1/2} c_0\sigma _\xi ^{1/2} T^{-1/4} \\&+ \mathcal {O}((\tau _{k+1}+\eta _{\textrm{NPG}}) \varUpsilon ^{3/2} m^{-1/4}\\&+ \eta _{\textrm{NPG}} \varUpsilon ^{5/4} m^{-1/8}),\\ \end{aligned}$$

\(c_0\) is defined in Assumption 2 and \(\sigma _\xi \) is defined in Assumption 1. Meanwhile, \(\varUpsilon \) is the radius of the parameter space, m is the width of the neural network, and T is the sample batch size.

Proof

We start from the following,

$$\begin{aligned}&\text {KL}(\pi ^*\Vert \pi _{k})-\text {KL}(\pi ^*\Vert \pi _{k+1})-\text {KL}(\pi _{k+1}\Vert \pi _{k}) \nonumber \\&= \mathbb {E}_{a \sim \pi ^*}\big [\log (\frac{\pi _{k+1}}{\pi _{k}})\big ] -\mathbb {E}_{a \sim \pi _{k+1}}\big [\log (\frac{\pi _{k+1}}{\pi _{k}})\big ] \\ {}&\text {(by KL's definition)}.\nonumber \end{aligned}$$
(38)

We now show the building blocks of the proof. First, we add and subtract a few terms to RHS of Eq. (38) then take the expectation of both sides with respect to \(s\sim \nu _{\pi ^*}\). Rearrange these terms, we get,

$$\begin{aligned}&\mathbb {E}_{s\sim \nu _{\pi ^*}}\big [\text {KL}(\pi ^*\Vert \pi _{k})-\text {KL}(\pi ^*\Vert \pi _{k+1})-\text {KL}(\pi _{k+1}\Vert \pi _{k})\big ] \nonumber \\&= \eta _{\textrm{NPG}}\mathbb {E}_{s\sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{k}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big ] \nonumber \\&+ H_k \end{aligned}$$
(39)

where \(H_k\) is denoted by,

$$\begin{aligned} H_k&:= \mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\log (\frac{\pi _{k+1}}{\pi _{k}})-\eta _{\textrm{NPG}} \tilde{Q}_{\omega _k}\big ]\nonumber \\&-\mathbb {E}_{a \sim \pi _{k}}\big [\log (\frac{\pi _{k+1}}{\pi _{k}})-\eta _{\textrm{NPG}} \tilde{Q}_{\omega _k}]\big ]\nonumber \\&+\eta _{\textrm{NPG}} \mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\omega _k}-\tilde{Q}_{\pi _k, \hat{y}_k}]\nonumber \\&-\mathbb {E}_{a \sim \pi _{k}}[\tilde{Q}_{\omega _k}-\tilde{Q}_{\pi _k, \hat{y}_k}]\big ] \nonumber \\&+\mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi _{k}}[\log (\frac{\pi _{k+1}}{\pi _{k}})]\nonumber \\&-\mathbb {E}_{a \sim \pi _{k+1}}[\log (\frac{\pi _{k+1}}{\pi _{k}})]\big ] \end{aligned}$$
(40)

By Lemma 3, we have

$$\begin{aligned}&\eta _{\textrm{NPG}}\mathbb {E}_{s\sim \nu _{\pi ^*}}\big [\mathbb {E}_{a \sim \pi ^*}[\tilde{Q}_{\pi _k,\hat{y}_k}]-\mathbb {E}_{a \sim \pi _{k}}[\tilde{Q}_{\pi _k,\hat{y}_k}]\big ] \nonumber \\&= \eta _{\textrm{NPG}}(1-\gamma )\big (J^{\hat{y}_k}_\lambda (\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k)\big ) \end{aligned}$$
(41)

Insert Eqs. (41) back to Eq. (39), we have,

$$\begin{aligned}&\eta _{\textrm{NPG}} (1-\gamma )\big (J^{\hat{y}_k}_\lambda (\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k)\big ) \nonumber \\&= \mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\text {KL}(\pi ^*\Vert \pi _{k})-\text {KL}(\pi ^*\Vert \pi _{k+1}) -\text {KL}(\pi _{k+1}\Vert \pi _{k})\big ]\nonumber \\ {}&-H_k \nonumber \\ \le&\, \mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\text {KL}(\pi ^*\Vert \pi _{k})-\text {KL}(\pi ^*\Vert \pi _{k+1}) -\text {KL}(\pi _{k+1}\Vert \pi _{k})\big ]\nonumber \\ {}&+|H_k| \end{aligned}$$
(42)

We reach the final inequality of Eq. (42) by algebraic manipulation. Second, we follow Lemma 5.5 of [51] and obtain an upper bound for Eq. (40). Specifically, with probability of \(1-\delta \),

$$\begin{aligned}&\mathbb {E}_{a \sim \textrm{init}}\Big [|H_k|-\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi _{k+1}\Vert \pi _{k})]\Big ] \nonumber \\ \le&\,\eta _{\textrm{NPG}}^2 (9\varUpsilon ^2+r_{\max }^2)+2\eta _{\textrm{NPG}} c_0 \epsilon '_{k} + \epsilon ''_{k} \end{aligned}$$
(43)

The expectation is taken over randomness. With these building blocks of Eqs. (42) and (43), we are now ready to reach the concluding inequality. Plugging Eqs. (43) back into Eq. (42), we end up with, with probability of \(1-\delta \),

$$\begin{aligned}&\eta _{\textrm{NPG}} (1-\gamma )\big (J^{\hat{y}_k}_\lambda (\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k)\big ) \nonumber \\ \le&\, \mathbb {E}_{s \sim \nu _{\pi ^*}}\big [\text {KL}(\pi ^*\Vert \pi _{k})-\text {KL}(\pi ^*\Vert \pi _{k+1}) \big ] \nonumber \\&+ \eta _{\textrm{NPG}}^2 (9\varUpsilon ^2+ r_{\max }^2)+2\eta _{\textrm{NPG}} c_0 \epsilon '_{k} + \epsilon ''_{k} \end{aligned}$$
(44)

Dividing both sides of Eq. (44) by \(\eta _{\textrm{NPG}}\) completes the proof. The details are included in the Appendix.

We have the following Lemma to bound the error terms \(H_k\) defined in Eq. (40) of Lemma 11.

Lemma 12

[51]. Under Assumptions 4, we have

$$\begin{aligned}&\mathbb {E}_{a \sim \textrm{init}}\Big [|H_k|-\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi _{k+1}\Vert \pi _{k})]\Big ] \\ \le&\,\eta _{\textrm{NPG}}^2 (9\varUpsilon ^2+r_{\max }^2)+\eta _{\textrm{NPG}}(\varphi '_k+\psi '_k) \epsilon '_{k} + \epsilon ''_{k} \\ \end{aligned}$$

Here the expectation is taken over all the randomness. We have \(\epsilon '_{k}:=\Vert Q_{\omega _k}-Q_{\pi _k}\Vert ^2_{\nu _{\pi _k}}\) and

$$\begin{aligned} \epsilon ''_{k}&= \sqrt{2}\varUpsilon ^{1/2}\eta _{\textrm{NPG}}(\varphi _k+\psi _k)\tau _k^{-1}\big \{\mathbb {E}_{(s,a) \sim \sigma _{\pi _{\theta _k}}}[\Vert \xi _k(\delta _k)\Vert _2^2 ] \\&+ \mathbb {E}_{(s,a) \sim \sigma _{\pi _{\omega _k}}}[\Vert \xi _k(\omega _k)\Vert _2^2 ]\big \}^{1/2} \\&+ \mathcal {O}((\tau _{k+1}+\eta _{\textrm{NPG}}) \varUpsilon ^{3/2} m^{-1/4} + \eta _{\textrm{NPG}} \varUpsilon ^{5/4} m^{-1/8}). \end{aligned}$$

Recall \(\xi _k(\omega _k)\) and \(\xi _k(\omega _k)\) are defined in Assumption 1, while \(\varphi _k\),\(\psi _k\), \(\varphi '_k\), and \(\psi _k\) are defined in Assumption 2.

Please refer to [51] for complete proof. Finally, we are ready to show the proof for Theorem 2.

Proof

First, we combine Lemma 4 and 11 to get the following:

$$\begin{aligned}&(1-\gamma )\big (J^{y^*}_\lambda (\pi ^*)-J^{\hat{y}_k}_\lambda (\pi ^*) + J^{\hat{y}_k}_{\lambda }(\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k)\big )\nonumber \\ \le&\,\eta _{\textrm{NPG}}^{-1}\mathbb {E}_{s \sim \nu _{\pi ^*}}\left[ \text {KL}(\pi ^*\Vert \pi _{k})-\text {KL}(\pi ^*\Vert \pi _{k+1})\right] \nonumber \\&+ \eta _{\textrm{NPG}} (9\varUpsilon ^2+r_{\max }^2)+2 c_0\epsilon '_{k} + \eta _{\textrm{NPG}}^{-1}\epsilon ''_{k} \nonumber \\&+ \frac{2c_3M(1-\gamma )^2\lambda }{\sqrt{k}} \end{aligned}$$
(45)

We can then see this:

  1. 1.

    \(\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{1})] \le \log |\mathcal {A}|\) due to the uniform initialization of policy.

  2. 2.

    \(\text {KL}(\pi ^*\Vert \pi _{K+1})\) is a non-negative term.

And by setting \(\eta _{\textrm{NPG}}=1/\sqrt{K}\) and telescoping Eq. (45), we obtain,

$$\begin{aligned}&(1-\gamma )\min _{k\in [K]}\big (J^{y^*}_{\lambda }(\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k)\big )\nonumber \\ \le&\, (1-\gamma )\frac{1}{K}\sum _{k=1}^K\mathbb {E}(J^{y^*}_{\lambda }(\pi ^*)-J^{\hat{y}_k}_{\lambda }(\pi _k))\nonumber \\ \le&\, \frac{1}{\sqrt{K}}(\mathbb {E}_{s \sim \nu _{\pi ^*}}[ \text {KL}(\pi ^*\Vert \pi _{1})]+9\varUpsilon ^2+r_{\max }^2)+ \frac{1}{K}\sum _{k=1}^K\nonumber \\&(2\sqrt{K}c_0\epsilon '_{k} + \eta _{\textrm{NPG}}^{-1}\epsilon ''_{k}+\frac{2c_3M(1-\gamma )^2\lambda }{\sqrt{k}}) \end{aligned}$$
(46)

plug \(\epsilon '_{k}\) and \(\epsilon ''_{k}\) defined in Lemma 11 into Eq. (46), and set \(\epsilon _k\) as,

$$\begin{aligned} \epsilon _k&= \sqrt{8} c_0 \varUpsilon ^{1/2} \sigma _\xi ^{1/2} T^{-1/4} \\&+ \mathcal {O}\big ((\tau _{k+1} K^{1/2}+1) \varUpsilon ^{3/2} m^{-1/4}+ \varUpsilon ^{5/4} m^{-1/8}\big ) \\&+ c_0\mathcal {O}(\varUpsilon ^{3}m^{-1/2}\log (1/\delta )+\varUpsilon ^{5/2}m^{-1/4}\sqrt{\log (1/\delta )} \\&+\varUpsilon r_{\max }^2m^{-1/4}+\varUpsilon ^2K_{\textrm{TD}}^{-1/2}+\varUpsilon ) \end{aligned}$$

we complete the proof.

1.5 D.5 Proof of Lemma 1

Proof

First, we have \(\mathbb {E}[G] = \frac{1}{1-\gamma }\mathbb {E}[R]\), i.e., the per-step reward R is an unbiased estimator of the cumulative reward G. Second, it is proved that \(\mathbb {V}(G) \le \frac{\mathbb {V}(R)}{(1-\gamma )^2}\) [7]. Given \(\lambda \ge 0\), summing up the above equality and inequality, we have

$$\begin{aligned} \frac{1}{(1-\gamma )}J_{\frac{\lambda }{(1-\gamma )}} (\pi )&= \frac{1}{(1-\gamma )}\Big (\mathbb {E}[R] - \frac{\lambda }{(1-\gamma )} \mathbb {V}(R)\Big )\\&\le \mathbb {E}[G] - \lambda \mathbb {V}(G) = J^G_\lambda (\pi ). \end{aligned}$$

It completes the proof.

1.6 D.6 Proof of Lemma 2

We first provide the supporting lemmas for Lemma 2. We define the local linearization of \(f((s,a);\theta )\) defined in Eq. (4) at the initial point \(\varTheta _{\textrm{init}}\) as,

$$\begin{aligned} \hat{f}((s,a);\theta )=\frac{1}{\sqrt{m}}\sum _{v=1}^{m} b_v\mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)>0\} [\theta ]_v^\top (s,a) \end{aligned}$$
(47)

We then define the following function spaces,

$$\begin{aligned}&\mathcal {F}_{\varUpsilon ,m}:= \Bigg \{\frac{1}{\sqrt{m}}\sum _{v=1}^{m} b_v\mathbbm {1}\big \{[\varTheta _{\textrm{init}}]_v^\top (s,a)>0\big \} [\theta ]_v^\top (s,a):\\&\Vert \theta -\varTheta _{\textrm{init}}\Vert _2 \le \varUpsilon \Bigg \}, \end{aligned}$$

and

$$\begin{aligned}&\bar{\mathcal {F}}_{\varUpsilon ,m}:= \Bigg \{\frac{1}{\sqrt{m}}\sum _{v=1}^{m} b_v\mathbbm {1}\big \{[\varTheta _{\textrm{init}}]_v^\top (s,a)>0\big \} [\theta ]_v^\top (s,a):\\&\Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Vert _\infty \le \varUpsilon /\sqrt{md}\Bigg \}. \end{aligned}$$

\([\varTheta _{\textrm{init}}]_r\sim \mathcal {N}(0,I_d/d)\) and \(b_r\sim \text {Unif}(\{-1,1\})\) are the initial parameters. By the definition, \(\bar{\mathcal {F}}_{\varUpsilon ,m}\) is a subset of \(\mathcal {F}_{\varUpsilon ,m}\). The following lemma characterizes the deviation of \(\bar{\mathcal {F}}_{\varUpsilon ,m}\) from \(\mathcal {F}_{\varUpsilon ,\infty }\).

Lemma 13

(Projection Error) [40]. Let \(f\in \mathcal {F}_{\varUpsilon ,\infty }\), where \(\mathcal {F}_{\varUpsilon ,\infty }\) is defined in Assumption 3. For any \(\delta >0\), it holds with probability at least \(1-\delta \) that

$$\begin{aligned} \Vert \varPi _{\bar{\mathcal {F}}_{\varUpsilon ,m}}f-f\Vert _\varsigma \le \varUpsilon m^{-1/2}[1+\sqrt{2\log (1/\delta )}] \end{aligned}$$

where \(\varsigma \) is any distribution over \(S \times A\).

Please refer to [40] for a detail proof.

Lemma 14

(Linearization Error). Under Assumption 4, for all \(\theta \in \mathcal {D}\), where \(\mathcal {D} = \{\xi \in \mathbb {R}^{md}:\Vert \xi -\varTheta _{\text {init}} \Vert _2 \le \varUpsilon \}\), it holds that,

$$\begin{aligned} \mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta \big )-\hat{f}\big ((s,a);\theta \big )\Big )^2\Big ] \le \frac{4c_1 \varUpsilon ^3}{\sqrt{m}} \end{aligned}$$

where \(c_1 = c\sqrt{\mathbb {E}_{\mathcal {N}(0,I_d/d)}[1/\Vert (s,a)\Vert _2^2]}\), and c is defined in Assumption 4.

Proof

We start from the definitions in Eq. (4) and Eq. (47),

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta \big )-\hat{f}\big ((s,a);\theta \big )\Big )^2\Big ]\nonumber \\&=\mathbb {E}_{\nu _\pi }\Big [\Big (\frac{1}{\sqrt{m}}\Big |\sum ^m_{v=1}\Big (\big (\mathbbm {1}\{[\theta ]_v^\top (s,a)>0\} - \mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)\nonumber \\&>0\}\big ) b_v [\theta ]_v^\top (s,a)\Big )\Big |\Big )^2\Big ]\nonumber \\ \le&\, \frac{1}{m}\mathbb {E}_{\nu _\pi }\Big [\Big (\sum ^m_{v=1}\Big (\Big |\mathbbm {1}\{[\theta ]_v^\top (s,a)>0\} - \mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)\nonumber \\&>0\}\Big | \Big |b_v\Big | \Big | [\theta ]_v^\top (s,a)\Big |\Big )\Big )^2\Big ] \end{aligned}$$
(48)

The above inequality holds because the fact that \(\vert \sum W \vert \le \sum \vert W \vert \), where \(W = \big ((\mathbbm {1}\{[\theta ]_v^\top (s,a)>0\} - \mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)>0\}) b_v [\theta ]_v^\top (s,a)\big )\). \(\varTheta _{\textrm{init}}\) is defined in Eq. (5). Next, since \(\mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)>0\} \ne \mathbbm {1}\{[\theta ]_v^\top (s,a)>0\}\), we have,

$$\begin{aligned} |[\varTheta _{\textrm{init}}]_v^\top (s,a)|&\le |[[\theta ]_v^\top (s,a) - \varTheta _{\textrm{init}}]_v^\top (s,a)| \nonumber \\&\le \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Vert _2 , \end{aligned}$$
(49)

where we obtain the last inequality from the Cauchy-Schwartz inequality. We also assume that \(\Vert (s,a)\Vert _2 \le 1\) without loss of generality [31, 51]. Equation (49) further implies that,

$$\begin{aligned}&|\mathbbm {1}\{[\theta ]_v^\top (s,a)>0\} - \mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)>0\}| \nonumber \\ \le&\, \mathbbm {1}\{|[\varTheta _{\textrm{init}}]_v^\top (s,a)|\le \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Vert _2\} \end{aligned}$$
(50)

Then plug Eq. (50) and the fact that \(|b_v|\le 1\) back to Eq. (48), we have the following,

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta \big )-\hat{f}\big ((s,a);\theta \big )\Big )^2\Big ]\nonumber \\ \le&\, \frac{1}{m}\mathbb {E}_{\nu _\pi }\bigg [\bigg (\sum ^m_{v=1}\mathbbm {1}\Big \{\Big |[\varTheta _{\textrm{init}}]_v^\top (s,a)\Big |\le \Big \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Big \Vert _2\Big \}\nonumber \\&\Big | [\theta ]_v^\top (s,a)\Big |\bigg )^2\bigg ] \nonumber \\ \le&\, \frac{1}{m}\mathbb {E}_{\nu _\pi }\bigg [\bigg (\sum ^m_{v=1}\mathbbm {1}\Big \{\Big |[\varTheta _{\textrm{init}}]_v^\top (s,a)\Big |\le \Big \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Big \Vert _2\Big \}\nonumber \\&\Big (\Big |\big ([\theta ]_v - [\varTheta _{\textrm{init}}]_v\big )^\top (s,a)\Big | + \Big | [\varTheta _{\textrm{init}}]_v^\top (s,a)\Big |\Big )\bigg )^2\bigg ] \nonumber \\ \le&\, \frac{1}{m}\mathbb {E}_{\nu _\pi }\bigg [\bigg (\sum ^m_{v=1}\mathbbm {1}\Big \{\Big |[\varTheta _{\textrm{init}}]_v^\top (s,a)\Big |\le \Big \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Big \Vert _2\Big \}\nonumber \\&\Big (\Big \Vert [\theta ]_v - [\varTheta _{\textrm{init}}]_v\Big \Vert _2 + \Big | [\varTheta _{\textrm{init}}]_v^\top (s,a)\Big |\Big )\bigg )^2\bigg ] \nonumber \\ \le&\, \frac{1}{m}\mathbb {E}_{\nu _\pi }\bigg [\bigg (\sum ^m_{v=1}\mathbbm {1}\Big \{\Big |[\varTheta _{\textrm{init}}]_v^\top (s,a)\Big |\le \Big \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Big \Vert _2\Big \}\nonumber \\&2\Big \Vert [\theta ]_v - [\varTheta _{\textrm{init}}]_v\Big \Vert _2\bigg )^2\bigg ] \end{aligned}$$
(51)

We obtain the second inequality by the fact that \(|A|\le |A-B|+|B|\). Then follow the Cauchy-Schwartz inequality and \(\Vert (s,a)\Vert _2 \le 1\) we have the third equality. By inserting Eq. (49) we achieve the fourth inequality. We continue Eq. (51) by following the Cauchy-Schwartz inequality and plugging \(\big \Vert [\theta ] - [\varTheta _{\textrm{init}}]\big \Vert _2 \le \varUpsilon \),

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta \big )-\hat{f}\big ((s,a);\theta \big )\Big )^2\Big ]\nonumber \\ \le&\, \frac{4\varUpsilon ^2}{m}\mathbb {E}_{\nu _\pi }\Big [\sum ^m_{v=1}\mathbbm {1}\{|[\varTheta _{\textrm{init}}]_v^\top (s,a)|\le \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Vert _2\}\Big ]\nonumber \\&= \frac{4\varUpsilon ^2}{m}\sum ^m_{v=1}P_{\nu _\pi }|[\varTheta _{\textrm{init}}]_v^\top (s,a)|\le \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Vert _2)\nonumber \\ \le&\, \frac{4c\varUpsilon ^2}{m}\sum ^m_{v=1}\frac{\Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Vert _2}{\Vert \varTheta _{\textrm{init}}]_v\Vert _2}\nonumber \\ \le&\, \frac{4c\varUpsilon ^2}{m}\Big (\sum ^m_{v=1}\Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Vert _2^2\Big )^{-1/2}\Big (\sum ^m_{v=1}\frac{1}{\Vert \varTheta _{\textrm{init}}]_v\Vert _2^2}\Big )^{-1/2}\nonumber \\ \le&\, \frac{4c_1 \varUpsilon ^3}{\sqrt{m}} \end{aligned}$$
(52)

We obtain the second inequality by imposing Assumption 4 and the third by following the Cauchy-Schwartz inequality. Finally, we set \(c_1 := c\sqrt{\mathbb {E}_{\mathcal {N}(0,I_d/d)}[1/\Vert (s,a)\Vert _2^2]} \). Thus, we complete the proof.

In the t-th iterations of TD iteration, we denote the temporal difference terms w.r.t \(\hat{f}((s,a);\theta _t)\) and \(f((s,a);\theta _t)\) as

$$\begin{aligned} \delta _t^0((s,a),(s,a)';\theta _t)&= \hat{f}((s,a)';\theta _t)-\gamma \hat{f}((s,a);\theta _t) \\&- r_{s,a},\\ \delta _t^\theta ((s,a),(s,a)';\theta _t)&= f((s,a)';\theta _t)-\gamma f((s,a);\theta _t) \\&- r_{s,a}. \end{aligned}$$

For notation simplicity in the sequel we write \(\delta _t^0((s,a),(s,a)';\theta _t)\) and \(\delta _t^\theta ((s,a),(s,a)';\theta _t)\) as \(\delta _t^0\) and \(\delta _t^\theta \). We further define the stochastic semi-gradient \(g_t(\theta _t):=\delta _t^\theta \nabla _{\theta } f((s,a);\theta _t)\), its population mean \(\bar{g}_t(\theta _t):=\mathbb {E}_{\nu _\pi }[g_t(\theta _t)]\). The local linearization of \(\bar{g}_t(\theta _t)\) is \(\hat{g}_t(\theta _t):=\mathbb {E}_{\nu _\pi }[\delta _t^0 \nabla _{\theta } \hat{f}((s,a);\theta _t)]\). We denote them as \(g_t, \bar{g}_t, \hat{g}_t\) respectively for simplicity.

Lemma 15

Under Assumption 4, for all \(\theta _t \in \mathcal {D}\), where \(\mathcal {D} = \{\xi \in \mathbb {R}^{md}:\Vert \xi -\varTheta _{\text {init}} \Vert _2 \le \varUpsilon \}\), it holds with probability of \(1-\delta \) that,

$$\begin{aligned}&\Vert \bar{g}_t-\hat{g}_t\Vert _2 \\&= \mathcal {O}\Big (\varUpsilon ^{3/2}m^{-1/4}\big (1+(m\log \frac{1}{\delta })^{-1/2}\big )+\varUpsilon ^{1/2}r_{\max } m^{-1/4}\Big ) \end{aligned}$$

Proof

By the definition of \(\bar{g}_t\) and \(\hat{g}_t\), we have

$$\begin{aligned}&\big \Vert \bar{g}_t-\hat{g}_t\big \Vert _2^2\nonumber \\&=\big \Vert \mathbb {E}_{\nu _\pi }[\delta _t^\theta \nabla _{\theta } f((s,a);\theta _t)-\delta _t^0 \nabla _{\theta } \hat{f}((s,a);\theta _t)]\big \Vert _2^2\nonumber \\&=\big \Vert \mathbb {E}_{\nu _\pi }[(\delta _t^\theta -\delta _t^0) \nabla _{\theta } f((s,a);\theta _t)+\delta _t^0 (\nabla _{\theta } f((s,a);\theta _t)-\nonumber \\&\nabla _{\theta } \hat{f}((s,a);\theta _t))]\big \Vert _2^2\nonumber \\ \le&\, 2\mathbb {E}_{\nu _\pi }\big [(\delta _t^\theta -\delta _t^0)^2 \Vert \nabla _{\theta } f((s,a);\theta _t)\Vert _2^2\big ] + \nonumber \\&2\mathbb {E}_{\nu _\pi }\big [\big (|\delta _t^0| \Vert \nabla _{\theta } f((s,a);\theta _t)-\nabla _{\theta } \hat{f}((s,a);\theta _t))\Vert _2\big )^2\big ] \end{aligned}$$
(53)

We obtain the inequality because \((A+B)^2 \le 2A^2+2B^2\). We first upper bound \(\mathbb {E}_{\nu _\pi }\big [(\delta _t^\theta -\delta _t^0)^2 \Vert \nabla _{\theta } f((s,a);\theta _t)\Vert _2^2\big ]\) in Eq. (53). Since \(\Vert (s,a)\Vert _2 \le 1\), we have \(\Vert \nabla _{\theta } f((s,a);\theta _t)\Vert _2 \le 1\). Then by definition, we have the following first inequality,

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\Big [\Big (\delta _t^\theta -\delta _t^0\Big )^2 \Big \Vert \nabla _{\theta } f((s,a);\theta _t)\Big \Vert _2^2\Big ] \nonumber \\ \le&\, \mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _t\big )-\gamma \Big (f\big ((s',a');\theta _t\big )\nonumber \\&-\hat{f}\big ((s',a');\theta _t)\big )\Big )\Big )^2\Big ] \nonumber \\ \le&\, \mathbb {E}_{\nu _\pi }\Big [\Big (\Big |f\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _t\big )\Big |+\Big |f\big ((s',a');\theta _t\big )\nonumber \\&-\hat{f}\big ((s',a');\theta _t\big )\Big |\Big )^2\Big ] \nonumber \\ \le&\, 2\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _t\big )\Big )^2\Big ]+2\mathbb {E}_{\nu _\pi }\nonumber \\&\Big [\Big (f\big ((s',a');\theta _t\big )-\hat{f}\big ((s',a');\theta _t\big )\Big )^2\Big ] \nonumber \\ \le&\, 4\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _t\big )\Big )^2\Big ]\le \frac{16c_1 \varUpsilon ^3}{\sqrt{m}} \end{aligned}$$
(54)

We obtain the second inequality by \(|\gamma | \le 1\), then obtain the third inequality by the fact that \((A+B)^2 \le 2A^2+2B^2\). We reach the final step by inserting Lemma 14. We then proceed to upper bound \(\mathbb {E}_{\nu _\pi }\big [|\delta _t^0| \Vert \nabla _{\theta } f((s,a);\theta _t)-\nabla _{\theta } \hat{f}((s,a);\theta _t))\Vert _2\big ]\). From Hölder’s inequality, we have,

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\big [\big (|\delta _t^0| \Vert \nabla _{\theta } f((s,a);\theta _t)-\nabla _{\theta } \hat{f}((s,a);\theta _t))\Vert _2\big )^2\big ] \nonumber \\ \le&\,\mathbb {E}_{\nu _\pi }\big [(\delta _t^0)^2\big ] \mathbb {E}_{\nu _\pi }\big [\Vert \nabla _{\theta } f((s,a);\theta _t)-\nabla _{\theta } \hat{f}((s,a);\theta _t))\Vert _2^2\big ]\nonumber \\ \end{aligned}$$
(55)

We first derive an upper bound for first term in Eq. (55), starting from its definition,

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\big [(\delta _t^0)^2\big ]\nonumber \\&=\mathbb {E}_{\nu _\pi }\Big [\big [\hat{f}\big ((s',a');\theta _t\big )-\gamma \hat{f}\big ((s,a);\theta _t\big ) - r_{s,a}\big ]^2\Big ]\nonumber \\ \le&\,3\mathbb {E}_{\nu _\pi }\Big [\big (\hat{f}\big ((s',a');\theta _t\big )\big )^2\Big ]+3\mathbb {E}_{\nu _\pi }\Big [\big (\gamma \hat{f}\big ((s,a);\theta _t\big )\big )^2\Big ]\nonumber \\&+3\mathbb {E}_{\nu _\pi }\Big [ r^2_{s,a}\Big ]\nonumber \\ \le&\, 6\mathbb {E}_{\nu _\pi }\Big [\big (\hat{f}\big ((s,a);\theta _t\big )\big )^2\Big ]+3r_{\max }^2 \nonumber \\&= 6\mathbb {E}_{\nu _\pi }\Big [\big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )+\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\nonumber \\&-Q_\pi +Q_\pi \big )^2\Big ]+3r_{\max }^2 \nonumber \\ \le&\, 18\mathbb {E}_{\nu _\pi }\Big [\big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\big )^2\Big ] + 18\mathbb {E}_{\nu _\pi }\nonumber \\&\Big [\big (\hat{f}\big ((s,a);\theta _{\pi ^*}\big )-Q_\pi \big )^2\Big ] + 18\mathbb {E}_{\nu _\pi }\Big [\big (Q_\pi \big )^2\Big ]+ 3r_{\max }^2 \nonumber \\ \le&\, 72\varUpsilon ^2 + 18\mathbb {E}_{\nu _\pi }\Big [\big (\hat{f}\big ((s,a);\theta _{\pi ^*}\big )-Q_\pi \big )^2\Big ]\nonumber \\&+ 21(1-\gamma )^{-2}r_{\max }^2 \end{aligned}$$
(56)

We obtain the first and the third inequality by the fact that \((A+B+C)^2 \le 3A^2+3B^2+3C^2\). Recall \(r_{\max }\) is the boundary for reward function r, which leads to the second inequality. We obtain the last inequality in Eq. (56) following the fact that \(|\hat{f}((s,a);\theta _t)-\hat{f}((s,a);\theta _{\pi ^*})| \le \Vert \theta _t-\theta _{\pi ^*}\Vert \le 2\varUpsilon \) and \(Q_\pi \le (1-\gamma )^{-1}r_{\max }\). Since \(\bar{\mathcal {F}}_{\varUpsilon ,m} \subset \mathcal {F}_{\varUpsilon ,m}\), by Lemma 13, we have,

$$\begin{aligned} {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s,a);\theta _{\pi ^*}\big )-Q_\pi \Big )^2\Big ] \le \frac{\varUpsilon ^2\big (1+\sqrt{2\log (1/\delta )}\big )^2}{m} \end{aligned}$$
(57)

Combine Eq. (56) and Eq. (57), we have with probability of \(1-\delta \),

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\big [(\delta _t^0)^2\big ]\nonumber \\ \le&\, 72\varUpsilon ^2(1+\frac{\log (1/\delta )}{m})+ 21(1-\gamma )^{-2}r_{\max }^2 \end{aligned}$$
(58)

Lastly we have

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\big [\Vert \nabla _{\theta } f((s,a);\theta _t)-\nabla _{\theta } \hat{f}((s,a);\theta _t))\Vert _2^2\big ]\nonumber \\&= \mathbb {E}_{\nu _\pi }\Big [\Big (\frac{1}{m}\sum ^m_{v=1}\big (\mathbbm {1}\{[\theta ]_v^\top (s,a)>0\} - \mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)\nonumber \\&>0\})^2 (b_v)^2 \Vert (s,a)\Vert _2^2\Big )\Big ]\nonumber \\ \le&\, \mathbb {E}_{\nu _\pi }\Big [\frac{1}{m}\sum ^m_{v=1}\big (\mathbbm {1}\{|[\varTheta _{\textrm{init}}]_v^\top (s,a)|\le \Vert [\theta ]_v-[\varTheta _{\textrm{init}}]_v\Vert _2\}\big )\Big ]\nonumber \\ \le&\, \frac{c_1 \varUpsilon }{\sqrt{m}} \end{aligned}$$
(59)

We obtain the first inequality by following Eq. (50) and the fact that \(|b_v| \le 1\) and \(\Vert (s,a)\Vert _2 \le 1\). Then for the rest, we follow the similar argument in Eq. (52). To finish the proof, we plug Eq. (54), Eq. (58) and Eq. (59) back to Eq. (53),

$$\begin{aligned}&\Vert \bar{g}_t-\hat{g}_t\Vert _2^2\\ \le&\, 2\Big (\frac{16c_1 \varUpsilon ^3}{\sqrt{m}} + \Big (72\varUpsilon ^2(1+\frac{\log (1/\delta )}{m})+ 21(1-\gamma )^{-2}r_{\max }^2\Big )\\&\frac{c_1 \varUpsilon }{\sqrt{m}}\Big ) \\&=\frac{176 c_1 \varUpsilon ^3}{\sqrt{m}} + \frac{144 c_1 \varUpsilon ^3\log (1/\delta )}{m^{3/2}}+ \frac{42 c_1 \varUpsilon r_{\max }^2}{(1-\gamma )^{-2}\sqrt{m}} \end{aligned}$$

Then we have,

$$\begin{aligned}&\Vert \bar{g}_t-\hat{g}_t\Vert _2\\ \le&\, \sqrt{\frac{176 c_1 \varUpsilon ^3}{\sqrt{m}} + \frac{144 c_1 \varUpsilon ^3\log (1/\delta )}{m^{3/2}}+ \frac{42 c_1 \varUpsilon r_{\max }^2}{(1-\gamma )^{-2}\sqrt{m}}} \\ \le&\, \sqrt{\frac{176 c_1 \varUpsilon ^3}{\sqrt{m}}} + \sqrt{\frac{144 c_1 \varUpsilon ^3\log (1/\delta )}{m^{3/2}}}+ \sqrt{\frac{42 c_1 \varUpsilon r_{\max }^2}{(1-\gamma )^{-2}\sqrt{m}}} \\&=\mathcal {O}\Big (\varUpsilon ^{3/2}m^{-1/4}\big (1+(m\log \frac{1}{\delta })^{-1/2}\big )+\varUpsilon ^{1/2}r_{\max } m^{-1/4}\Big ) \end{aligned}$$

Next, we provide the following lemma to characterize the variance of \(g_t\).

Lemma 16

(Variance of the Stochastic Update Vector) [31]. There exists a constant \(\xi _g^2=\mathcal {O}(\varUpsilon ^2)\) independent of t. Such that for any \(t \le T\), it holds that

$$\begin{aligned} \mathbb {E}_{\nu _\pi }[\Vert g_t(\theta _t)-\bar{g}_t(\theta _t)\Vert _2^2] \le \xi _g^2 \end{aligned}$$

A detailed proof can be found in [31]. Now we provide the proof for Lemma 2.

Proof

$$\begin{aligned}&\big \Vert \theta _{t+1}-\theta _{\pi ^*}\big \Vert _2^2\nonumber \\ =&\, \big \Vert \varPi _\mathcal {D}(\theta _t-\eta g_t(\theta _t))-\varPi _\mathcal {D}(\theta _{\pi ^*}-\eta \hat{g}_t(\theta _{\pi ^*}))\big \Vert _2^2\nonumber \\ \le&\,\big \Vert (\theta _t-\theta _{\pi ^*}) -\eta \big (g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big )\big \Vert _2^2\nonumber \\ =&\, \big \Vert \theta _t-\theta _{\pi ^*}\big \Vert _2^2 - 2\eta \big ( g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big )^\top \big (\theta _t-\theta _{\pi ^*}\big )\nonumber \\&+ \eta ^2 \big \Vert g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big \Vert _2^2 \end{aligned}$$
(60)

The inequality holds due to the definition of \(\varPi _\mathcal {D}\). We first upper bound \(\big \Vert g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big \Vert _2^2\) in Eq. (60),

$$\begin{aligned}&\big \Vert g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big \Vert _2^2\nonumber \\ =&\,\big \Vert g_t(\theta _t)-\bar{g}_t(\theta _t) +\bar{g}_t(\theta _t)-\hat{g}_t(\theta _t)+\hat{g}_t(\theta _t)- \hat{g}_t(\theta _{\pi ^*})\big \Vert _2^2\nonumber \\ \le&\,3\Big (\big \Vert g_t(\theta _t)-\bar{g}_t(\theta _t)\big \Vert _2^2 +\big \Vert \bar{g}_t(\theta _t)-\hat{g}_t(\theta _t)\big \Vert _2^2+\nonumber \\ {}&\big \Vert \hat{g}_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big \Vert _2^2\Big ) \end{aligned}$$
(61)

The inequality holds due to fact that \((A+B+C)^2 \le 3A^2+3B^2+3C^2\). Two of the terms on the right hand side of Eq. (61) are characterized in Lemma 15 and Lemma 16. We therefore characterize the remaining term,

$$\begin{aligned}&\big \Vert \hat{g}_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big \Vert _2^2\nonumber \\ =&\,\mathbb {E}_{\nu _\pi }\Big [\big (\delta _t^0(\theta _t)-\delta _t^0(\theta _{\pi ^*})\big )^2\big \Vert \nabla _{\theta }\hat{f}\big ((s,a);\theta _t\big )\big \Vert _2^2\Big ]\nonumber \\ \le&\,\mathbb {E}_{\nu _\pi }\bigg [\bigg (\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )-\gamma \Big (\hat{f}\big ((s',a');\nonumber \\&\theta _t\big )-\hat{f}\big ((s',a');\theta _{\pi ^*}\big )\Big )\bigg )^2\bigg ]\nonumber \\ \le&\,\mathbb {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )^2\Big ]+2\gamma \mathbb {E}_{\nu _\pi }\nonumber \\&\Big [\Big (\hat{f}\big ((s',a');\theta _t\big )-\hat{f}\big ((s',a');\theta _{\pi ^*}\big )\Big )\Big (\hat{f}\big ((s,a);\theta _t\big )\nonumber \\&-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )\Big ]\nonumber \\&+\gamma ^2\mathbb {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s',a');\theta _t\big )-\hat{f}\big ((s',a');\theta _{\pi ^*}\big )\Big )^2\Big ] \end{aligned}$$
(62)

We obtain the first inequality by the fact that \(\Vert \nabla _{\theta }\hat{f}((s,a);\theta _t)\Vert _2 \le 1\). Then we use the fact that (sa) and \((s',a')\) have the same marginal distribution as well as \(\gamma < 1\) for the second inequality. Follow the Cauchy-Schwarz inequality and the fact that (sa) and \((s',a')\) have the same marginal distribution, we have

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s',a');\theta _t\big )-\hat{f}\big ((s',a');\theta _{\pi ^*}\big )\Big )\Big (\hat{f}\big ((s,a);\theta _t\big )\nonumber \\&-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )\Big ]\nonumber \\ \le&\,\mathbb {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s',a');\theta _t\big )-\hat{f}\big ((s',a');\theta _{\pi ^*}\big )\Big )\Big ]\mathbb {E}_{\nu _\pi }\nonumber \\&\Big [\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )\Big ]\nonumber \\ =&\,\mathbb {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s',a');\theta _t\big )-\hat{f}\big ((s',a');\theta _{\pi ^*}\big )\Big )^2\Big ] \end{aligned}$$
(63)

We plug Eq. (63) back to Eq. (62),

$$\begin{aligned}&\big \Vert \hat{g}_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big \Vert _2^2 \nonumber \\ \le&\, (1+\gamma )^2\mathbb {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )^2\Big ]. \end{aligned}$$
(64)

Next, we upper bound \(\big ( g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big )^\top \big (\theta _t-\theta _{\pi ^*}\big )\). We have,

$$\begin{aligned}&\big ( g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big )^\top \big (\theta _t-\theta _{\pi ^*}\big ) \nonumber \\ =&\,\big ( g_t(\theta _t)-\bar{g}_t(\theta _t))\big )^\top \big (\theta _t-\theta _{\pi ^*}\big ) + \big ( \bar{g}_t(\theta _t)-\hat{g}_t(\theta _t)\big )^\top \nonumber \\&\big (\theta _t-\theta _{\pi ^*}\big ) + \big ( \hat{g}_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big )^\top \big (\theta _t-\theta _{\pi ^*}\big ) \end{aligned}$$
(65)

One term on the right hand side of Eq. (65) are characterized by Lemma 16. We continue to characterize the remaining terms. First, by Hölder’s inequality, we have

$$\begin{aligned}&\big ( \bar{g}_t(\theta _t)-\hat{g}_t(\theta _t)\big )^\top \big ( \theta _t-\theta _{\pi ^*}\big ) \nonumber \\ \ge&\, -\big \Vert \bar{g}_t(\theta _t)-\hat{g}_t(\theta _t)\big \Vert _2\big \Vert \theta _t-\theta _{\pi ^*}\big \Vert _2 \nonumber \\ \ge&\, -2\varUpsilon \Vert \bar{g}_t(\theta _t)-\hat{g}_t(\theta _t)\big \Vert _2 \end{aligned}$$
(66)

We obtain the second inequality since \(\big \Vert \theta _t-\theta _{\pi ^*}\big \Vert _2 \le 2\varUpsilon \) by definition. For the last term,

$$\begin{aligned}&\big ( \hat{g}_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big )^\top \big (\theta _t-\theta _{\pi ^*}\big ) \nonumber \\ =&\,\mathbb {E}_{\nu _\pi }\bigg [\bigg (\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )-\gamma \Big (\hat{f}\big ((s',a');\theta _t\big )\nonumber \\&-\hat{f}\big ((s',a');\theta _{\pi ^*}\big )\Big )\bigg )\Big (\nabla _{\theta }\hat{f}\big ((s,a);\theta _t\big )\Big )^\top \Big (\theta _t-\theta _{\pi ^*}\Big )\bigg ] \nonumber \\ =&\,\mathbb {E}_{\nu _\pi }\bigg [\bigg (\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )-\gamma \Big (\hat{f}\big ((s',a');\theta _t\big )\nonumber \\&-\hat{f}\big ((s',a');\theta _{\pi ^*}\big )\Big )\bigg )\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )\bigg ]\nonumber \\ \ge&\, \mathbb {E}_{\nu _\pi }\bigg [\bigg (\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )\bigg )^2\bigg ]\nonumber \\&-\gamma \mathbb {E}_{\nu _\pi }\bigg [\bigg (\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )\bigg )^2\bigg ]\nonumber \\ =&\, (1-\gamma )\mathbb {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )^2\Big ], \end{aligned}$$
(67)

where the inequality follows from Eq. (63). Combine Eqs. (60), (61), (64), (65), (66) and (67), we have,

$$\begin{aligned}&\big \Vert \theta _{t+1}-\theta _{\pi ^*}\big \Vert _2^2\nonumber \\ \le&\, \big \Vert \theta _t-\theta _{\pi ^*}\big \Vert _2^2 -\big (2\eta (1-\gamma )-3\eta ^2(1+\gamma )^2\big )\nonumber \\&\mathbb {E}_{\nu _\pi }\Big [\Big (\hat{f}\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )^2\Big ]\nonumber \\&+ 3\eta ^2\Vert \bar{g}_t-\hat{g}_t\Vert _2^2 +4\eta \varUpsilon \Vert \bar{g}_t-\hat{g}_t\Vert _2 + 4\varUpsilon \eta |\xi _g| \nonumber \\&+ 3\eta ^2\xi _g^2 \end{aligned}$$
(68)

We then bound the error terms by rearrange Eq. (68). First, we have, with probability of \(1-\delta \),

$$\begin{aligned}&\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )^2\Big ] \nonumber \\ =&\,\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _t\big )+\hat{f}\big ((s,a);\theta _t\big )\nonumber \\&-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )^2\Big ]\nonumber \\ \le&\,2\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta _t\big )-\hat{f}\big ((s,a);\theta _t\big )\Big )^2+\Big (\hat{f}\big ((s,a);\theta _t\big )\nonumber \\&-\hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )^2\Big ]\nonumber \\ \le&\,\big (\eta (1-\gamma )-1.5\eta ^2(1+\gamma )^2\big )^{-1}\Big (\big \Vert \theta _t-\theta _{\pi ^*}\big \Vert _2^2 \nonumber \\&- \Vert \theta _{t+1}-\theta _{\pi ^*}\big \Vert _2^2 + 4\varUpsilon \eta |\xi _g| + 3\eta ^2\xi _g^2\Big ) + \epsilon _g \end{aligned}$$
(69)

where

$$\begin{aligned} \epsilon _g&= \mathcal {O}(\varUpsilon ^{3}m^{-1/2}\log (1/\delta )+\varUpsilon ^{5/2}m^{-1/4}\sqrt{\log (1/\delta )}\nonumber \\&+\varUpsilon r_{\max }^2m^{-1/4}) \end{aligned}$$

We obtain the first inequality by the fact that \((A+B)^2\le 2A^2 + 2B^2\). Then by Eq. (68), Lemma 14 and Lemma 15, we reach the final inequality. By telescoping Eq. (69) for \(t = \) to T, we have, with probability of \(1-\delta \),

$$\begin{aligned}&\big \Vert f\big ((s,a);\theta _T\big ) - \hat{f}\big ((s,a);\theta _{\pi ^*}\big ) \big \Vert ^2 \\ \le&\, \frac{1}{T}\sum _{t=1}^{T}\mathbb {E}_{\nu _\pi }\Big [\Big (f\big ((s,a);\theta _t\big ) - \hat{f}\big ((s,a);\theta _{\pi ^*}\big )\Big )^2\Big ] \\ \le&\,T^{-1}\big (2\eta (1-\gamma )-3\eta ^2(1+\gamma )^2\big )^{-1}(\Vert \varTheta _{\textrm{init}}-\theta _{\pi ^*} \Vert + \\&4\varUpsilon T\eta |\xi _g|+3T\eta ^2\xi _g^2) + \epsilon _g \end{aligned}$$

Set \(\eta =\min \{1/\sqrt{T}, (1-\gamma )/3(1+\gamma )^2\}\), which implies that \(T^{-1/2}(2\eta (1-\gamma )-3\eta ^2(1+\gamma )^2)^{-1} \le 1/(1-\gamma )^2\), then we have, with probability of \(1-\delta \),

$$\begin{aligned}&\big \Vert f\big ((s,a);\theta _T\big ) - \hat{f}\big ((s,a);\theta _{\pi ^*}\big ) \big \Vert \\ \le&\,\frac{1}{(1-\gamma )^2\sqrt{T}}\big (\Vert \varTheta _{\textrm{init}}-\theta _{\pi ^*} \Vert _2^2 + 4\varUpsilon \sqrt{T}|\xi _g| \\&+3\xi _g^2\big ) + \epsilon _g \\ \le&\,\frac{\varUpsilon ^2 + 4\varUpsilon \sqrt{T}|\xi _g|+3\xi _g^2}{(1-\gamma )^2\sqrt{T}} + \epsilon _g \\ =&\, \mathcal {O}(\varUpsilon ^{3}m^{-1/2}\log (1/\delta )+\varUpsilon ^{5/2}m^{-1/4}\sqrt{\log (1/\delta )} \\&+\varUpsilon r_{\max }^2m^{-1/4}+\varUpsilon ^2T^{-1/2}+\varUpsilon ) \end{aligned}$$

We obtain the second inequality by the fact that \(\Vert \varTheta _{\textrm{init}}-\theta _{\pi ^*} \Vert _2\le \varUpsilon \). Then by definition we replace \(\tilde{Q}_{\omega _k}\) and \(\tilde{Q}_{\pi _k}\)

E Additional Related Work

1.1 E.1 Global Optimality of Policy Search Methods

A major challenge of existing RL research is the lack of theoretical justification, such as sample complexity analysis, mainly because the objective function of policy search in RL is often nonconvex. It is challenging to determine if a policy search approach is guaranteed to reach the global optimal. Besides, the RL architecture components are usually parameterized by neural networks in practice. Its nonlinearity and complex nature render the analysis significantly difficult [62].

The theoretical understanding of policy gradient methods is also under tentative study. Work on this topic has been done mostly in tabular and linear parametrization settings for different variants of policy gradient. For example, [11] and [44] establish a non-asymptotic convergence guarantee for natural policy gradient (NPG, [22]) and trusted region policy optimization (TRPO, [42]), respectively. [35] show converge rate for softmax parametrization, while [1] analyze multiple variants of policy gradient. On the other side of the spectrum, [31, 51] prove the global convergence and optimality of various policy gradient algorithms with over-parameterized neural networks. Furthermore, [62] apply the global optimality analysis to variance-constrained actor-critic risk-averse control with cumulative average rewards, and proposed a corresponding variance-constrained actor-critic (VARAC) algorithm. However, the analysis procedure is complicated due to the risk constraints on cumulative rewards, and the algorithm’s experimental performance remains unverified. Therefore, it remains interesting if there can be simplified global optimality analysis with verifiable experimental studies for risk-averse policy search methods.

1.2 E.2 Over-Parameterized Neural Networks in RL

Overparameterization, a technique of deploying more parameters than necessary, improves the performance of neural networks [59]. The learning ability and generalization of over-parameterized neural networks have been studied extensively [2, 5, 15]. Integration with over-parameterized neural networks can be found in multiple RL topics. One line of work is to prove the global optimality of RL algorithms in a non-linear approximation setting [31, 51, 62]. They use ReLU activation over-parameterized neural networks with policy gradient methods such as NPG and PPO. Our work also belongs to this category. Other works include [19], which also deploy a two-layered ReLU activation over-parameterized neural network on mean-field multi-agent reinforcement learning problem. Regularization with over-parameterized neural networks is also investigated recently [25, 41].

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, L., Lyu, D., Pan, Y., Jiang, A., Liu, B. (2022). TOPS: Transition-Based Volatility-Reduced Policy Search. In: Melo, F.S., Fang, F. (eds) Autonomous Agents and Multiagent Systems. Best and Visionary Papers. AAMAS 2022. Lecture Notes in Computer Science(), vol 13441. Springer, Cham. https://doi.org/10.1007/978-3-031-20179-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20179-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20178-3

  • Online ISBN: 978-3-031-20179-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics