Abstract
Existing risk-averse reinforcement learning approaches still face several challenges, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Transition-based vOlatility-controlled Policy Search (TOPS), a novel algorithm that solves risk-averse problems by learning from transitions. We prove that our algorithm—under the over-parameterized neural network regime—finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient. The convergence rate is comparable to the state-of-the-art risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of TOPS’ performance among existing risk-averse policy search methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. J. Mach. Learn. Res. 22(98), 1–76 (2021)
Allen-Zhu, Z., Li, Y., Liang, Y.: Learning and generalization in overparameterized neural networks, going beyond two layers. In: Advances in Neural Information Processing Systems 32 (2019)
Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization (2019)
Antos, A., Szepesvári, C., Munos, R.: Fitted Q-iteration in continuous action-space MDPs. In: Advances in Neural Information Processing Systems 20 (2007)
Arora, S., Du, S., Hu, W., Li, Z., Wang, R.: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In: International Conference on Machine Learning, pp. 322–332. PMLR (2019)
Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 (2019)
Bisi, L., Sabbioni, L., Vittori, E., Papini, M., Restelli, M.: Risk-averse trust region optimization for reward-volatility reduction. In: Bessiere, C. (ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-2020, pp. 4583–4589. International Joint Conferences on Artificial Intelligence Organization, July 2020. Special Track on AI in FinTech
Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)
Cai, Q., Yang, Z., Lee, J.D., Wang, Z.: Neural temporal-difference and Q-learning provably converge to global optima. arXiv preprint arXiv:1905.10027 (2019)
Cao, Y., Gu, Q.: Generalization error bounds of gradient descent for learning over-parameterized deep ReLU networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3349–3356 (2020)
Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4), 2563–2578 (2021)
Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, Cambridge (2011)
Dabney, W., et al.: A distributional code for value in dopamine-based reinforcement learning. Nature 577(7792), 671–675 (2020)
Di Castro, D., Tamar, A., Mannor, S.: Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404 (2012)
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C., Mannor, S.: Regularized policy iteration with nonparametric function spaces. J. Mach. Learn. Res. 17(1), 4809–4874 (2016)
Fu, Z., Yang, Z., Wang, Z.: Single-timescale actor-critic provably finds globally optimal policy. arXiv preprint arXiv:2008.00483 (2020)
Garcıa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)
Gu, H., Guo, X., Wei, X., Xu, R.: Mean-field multi-agent reinforcement learning: a decentralized network approach. arXiv preprint arXiv:2108.02731 (2021)
Hans, A., Schneegaß, D., Schäfer, A.M., Udluft, S.: Safe exploration for reinforcement learning. In: ESANN, pp. 143–148. Citeseer (2008)
Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proceedings of the 19th International Conference on Machine Learning. Citeseer (2002)
Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems 14 (2001)
Konstantopoulos, T., Zerakidze, Z., Sokhadze, G.: Radon-Nikodým theorem. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science, pp. 1161–1164. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-04898-2_468
Kovács, B.: Safe reinforcement learning in long-horizon partially observable environments (2020)
Kubo, M., Banno, R., Manabe, H., Minoji, M.: Implicit regularization in over-parameterized neural networks. arXiv preprint arXiv:1903.01997 (2019)
La, P., Ghavamzadeh, M.: Actor-critic algorithms for risk-sensitive MDPs. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. (2013)
Lai, T.L., Xing, H., Chen, Z.: Mean-variance portfolio optimization when means and covariances are unknown. Ann. Appl. Stat. 5(2A), June 2011. https://doi.org/10.1214/10-aoas422
Laroche, R., Tachet des Combes, R.: Dr Jekyll and Mr Hyde: the strange case of off-policy policy updates. In: Advances in Neural Information Processing Systems 34 (2021)
Li, D., Ng, W.L.: Optimal dynamic portfolio selection: multiperiod mean-variance formulation. Math. Financ. 10(3), 387–406 (2000)
Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., Petrik, M.: Finite-sample analysis of proximal gradient TD algorithms. In: Proceedings of the Conference on Uncertainty in AI (UAI), pp. 504–513 (2015)
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural trust region/proximal policy optimization attains globally optimal policy. In: Advances in Neural Information Processing Systems 32 (2019)
Majumdar, A., Pavone, M.: How should a robot assess risk? Towards an axiomatic theory of risk in robotics. In: Amato, N.M., Hager, G., Thomas, S., Torres-Torriti, M. (eds.) Robotics Research. SPAR, vol. 10, pp. 75–84. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-28619-4_10
Mannor, S., Tsitsiklis, J.: Mean-variance optimization in Markov decision processes. arXiv preprint arXiv:1104.5601 (2011)
Markowitz, H.M., Todd, G.P.: Mean-Variance Analysis in Portfolio Choice and Capital Markets, vol. 66. Wiley, New York (2000)
Mei, J., Xiao, C., Szepesvari, C., Schuurmans, D.: On the global convergence rates of softmax policy gradient methods. In: International Conference on Machine Learning, pp. 6820–6829. PMLR (2020)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Munos, R.: Performance bounds in Lp-norm for approximate value iteration. SIAM J. Control. Optim. 46(2), 541–561 (2007)
Munos, R., Szepesvári, C.: Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9(5), 815–857 (2008)
Parker, D.: Managing risk in healthcare: understanding your safety culture using the Manchester patient safety framework (MaPSaF). J. Nurs. Manag. 17(2), 218–222 (2009)
Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems 21 (2008)
Satpathi, S., Gupta, H., Liang, S., Srikant, R.: The role of regularization in overparameterized neural networks. In: 2020 59th IEEE Conference on Decision and Control (CDC), pp. 4683–4688. IEEE (2020)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897. PMLR (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5668–5675 (2020)
Sobel, M.J.: The variance of discounted Markov decision processes. J. Appl. Probab. 19(4), 794–802 (1982)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. A Bradford Book. MIT Press, Cambridge (2018)
Sutton, R.S., et al.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: International Conference on Machine Learning, pp. 993–1000 (2009)
Thomas, G., Luo, Y., Ma, T.: Safe reinforcement learning by imagining the near future. In: Advances in Neural Information Processing Systems 34 (2021)
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)
Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence (2019)
Wang, M., Fang, E.X., Liu, H.: Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program. 161(1–2), 419–449 (2017)
Wang, W.Y., Li, J., He, X.: Deep reinforcement learning for NLP. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 19–21 (2018)
Weng, J., Duburcq, A., You, K., Chen, H.: MuJoCo benchmark (2020). https://tianshou.readthedocs.io/en/master/tutorials/benchmark.html
Xie, T., et al.: A block coordinate ascent algorithm for mean-variance optimization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018). https://proceedings.neurips.cc/paper/2018/file/4e4b5fbbbb602b6d35bea8460aa8f8e5-Paper.pdf
Xu, P., Chen, J., Zou, D., Gu, Q.: Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In: Advances in Neural Information Processing Systems (2018)
Xu, T., Liang, Y., Lan, G.: CRPO: a new approach for safe reinforcement learning with convergence guarantee. In: International Conference on Machine Learning, pp. 11480–11491. PMLR (2021)
Yang, L., Wang, M.: Reinforcement learning in feature space: matrix bandit, kernels, and regret bound. In: International Conference on Machine Learning, pp. 10746–10756. PMLR (2020)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)
Zhang, S., Liu, B., Whiteson, S.: Mean-variance policy iteration for risk-averse reinforcement learning. In: AAAI Conference on Artificial Intelligence (AAAI) (2021)
Zhang, S., Tachet, R., Laroche, R.: Global optimality and finite sample analysis of softmax off-policy actor critic under state distribution mismatch. arXiv preprint arXiv:2111.02997 (2021)
Zhong, H., Fang, E.X., Yang, Z., Wang, Z.: Risk-sensitive deep RL: variance-constrained actor-critic provably finds globally optimal policy (2020)
Zou, D., Cao, Y., Zhou, D., Gu, Q.: Gradient descent optimizes over-parameterized deep ReLU networks. Mach. Learn. 109(3), 467–492 (2020)
Acknowledgment
BL’s research is funded by the National Science Foundation (NSF) under grant NSF IIS1910794, Amazon Research Award, and Adobe gift fund.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Notation Systems
-
\((\mathcal {S}, \mathcal {A}, \mathcal {P}, r,\gamma )\) with state space \(\mathcal {S}\), action space \(\mathcal {A}\), the transition kernel \(\mathcal {P}\), the reward function r, the initial state \(S_0\) and its distribution \(\mu _{0}\), and the discounted factor \(\gamma \).
-
\(r_{\max } > 0\) is a constant as the upper bound of the reward.
-
State value function \(V_{\pi }(s)\) and state-action value function \(Q_{\pi }(s,a)\).
-
The normalized state and state action occupancy measure of policy \(\pi \) is denoted by \(\nu _\pi (s)\) and \(\sigma _\pi (s,a)\)
-
T is the length of a trajectory.
-
The return is defined as G. \(J(\pi )\) is the expectation of G.
-
Policy \(\pi _\theta \) is parameterized by the parameter \(\theta \).
-
\(\tau \) is the temperature parameter in the softmax parameterization of the policy.
-
\(F(\theta )\) is the Fisher information matrix.
-
\(\eta _TD\) is the learning rate of TD update. Similarly, \(\eta _NPG\) is the learning rate of NPG update. \(\eta _PPO\) is the learning rate of PPO update.
-
\(\beta \) is the penalty factor of KL difference in PPO update.
-
\(f\big ((s,a);\theta \big )\) is the two-layer over-parameterized neural network, with m as its width.
-
\(\phi _\theta \) is the feature mapping of the neural network.
-
\(\mathcal {D}\) is the parameter space for \(\theta \), with \(\varUpsilon \) as its radius.
-
\(M >0\) is a constant as the initialization upper bound on \(\theta \).
-
\(J^G_\lambda (\pi )\) is the mean-variance objective function.
-
\(J_\lambda (\pi )\) is the reward-volatility objective function, with \(\lambda \) as the penalty factor.
-
\(J_\lambda ^y(\pi )\) is the transformed reward-volatility objective function, with y as the auxiliary variable.
-
\(\tilde{r}\) is the reward for the augmented MDP. Similarly, \(\tilde{V}_\pi (s)\) and \(\tilde{Q}_\pi (s,a)\) are state value function and state-action value function of the augmented MDP, respectively. \(\tilde{J}(\pi )\) is the risk-neural objective of the augmented MDP.
-
\(\hat{y}_{k}\) is an estimator of y at k-th iteration.
-
\(\omega \) is the parameter of critic network.
-
\(\delta _k=\text {argmin}_{\delta \in \mathcal {D}}\Vert \hat{F}(\theta _k)\delta -\tau _k\hat{\nabla }_\theta J(\pi _{\theta _k} )\Vert _2\).
-
\(\xi _k(\delta )=\hat{F}(\theta _k)\delta -\tau _k\hat{\nabla }_\theta \tilde{J}(\pi _{\theta _k})-\mathbb {E}[\hat{F}(\theta _k)\delta -\tau _k\hat{\nabla }_\theta \tilde{J}(\pi _{\theta _k} )]\).
-
\(\sigma _\xi \) is a constant associated with the upper bound of the gradient variance.
-
\(\varphi _k,\psi _k,\varphi '_k,\psi '_k\) are the concentability coefficients, upper bounded by a constant \(c_0 > 0\).
-
\(\varphi ^*_{k} = \mathbb {E}_{(s,a) \sim \sigma _\pi }\bigg [\big (\frac{d\pi ^*}{d\pi _0}-\frac{d\pi _{\theta _k}}{d\pi _0}\big )^2\bigg ]^{1/2}\).
-
\(\psi ^*_{k} = \mathbb {E}_{(s,a) \sim \sigma _\pi }\bigg [\big (\frac{d\sigma _{\pi ^*}}{d\sigma _\pi }-\frac{d\nu _{\pi ^*}}{d\nu _\pi }\big )^2\bigg ]^{1/2}\).
-
K is the total number of iterations. Similarly, \(K_\textrm{TD}\) is the total number of TD iterations.
-
\(c_3>0\) is a constant as to quantify the difference in risk-neutral objective between optimal policy and any policy.
B Algorithm Details
We provide a comparison between MVPI and TOPS. Note that neither NPG nor PPO solve \(\theta _{k}:=\arg \max _\theta ( \tilde{J}(\pi _{\theta _k}) )\) directly, but instead solve an approximation optimization problem at each iteration. We provide pseudo-code for the implementation of MVPI and VARAC in Algorithm 3 and .
C Experimental Details
Note that although the mean-volatility method can be adapted to off-policy methods [60], in this paper, for the ease of the theoretical analysis, our proposed method is an on-policy actor-critic algorithm.
1.1 C.1 Testbeds
We use six Mujoco tasks from Open AI gym [8] as testbeds. They are HalfCheetah-v2, Hopper-V2, Swimmer-V2, Walker2d-V2, InvertedPendulum-v2, and InvertedDoublePendulum-v2.
1.2 C.2 Hyper-parameter Settings
In the experiment we set \(\lambda = 1\). We then tune learning rate for different algorithms. For MVP, we use the same setting as [60]. For MVPI, TOPS and VARAC with neural NPG, we tune the learning rate of the actor network from \(\{0.1, 1\times 10^{-2}, 1\times 10^{-3}, 7\times 10^{-4}\}\) and the learning rate of the critic network from \(\{1\times 10^{-2}, 1\times 10^{-3}, 7\times 10^{-4}\}\). For MVPI, TOPS and VARAC with neural PPO, we tune the learning rate of the actor network from \(\{3\times 10^{-3}, 3\times 10^{-4}, 3\times 10^{-5}\}\) and the learning rate of the critic network from \(\{1\times 10^{-2}, 1\times 10^{-3}, 1\times 10^{-4}\}\).
1.3 C.3 Computing Infrastructure
We conducted our experiments on a GPU GTX 970 and GPU GTX 1080Ti.
D Theoretical Analysis Details
In this section, we discuss the theoretical analysis in detail. We first present the overview in Sect. D.1. Then we provide additional assumptions in Sect. D.2. In the rest of the section, we present all the supporting lemmas and the proof for Theorem 1 and 2.
1.1 D.1 Overview
We provide Fig. 5 to illustrate the structure of the theoretical analysis. First, under Assumption 3 and 4, as well as Lemma 13. We can obtain Lemma 14, 15 and 16. These are the building blocks of Lemma 2, which is a shared component in the analysis of both NPG and PPO. The shared components also include Lemma 3, as well as Lemma 4 obtained under Assumption 5. For PPO analysis, under Assumption 2 and 4, we obtain Lemma 7 and 8 from Lemma 2 and 6, Then combined with Lemma 3, 4 and 9, we obtain Theorem 1, the major result of PPO analysis. Likely for NPG analysis, we first obtain Lemma 11 and 12 under Assumption 1, 2 and 4. Then together with Lemma 2, 3, 4 and 10, we obtain Theorem 2, the major result of NPG analysis.
1.2 D.2 Additional Assumptions
Assumption 3
(Action-value function class). We define
where \(\mu :\mathbb {R}^d \rightarrow [0,1] \) is a probability density function of \(\mathcal {N}(0,I_d/d)\). \(f_0(s,a)\) is the two-layer neural network corresponding to the initial parameter \(\varTheta _{\textrm{init}}\), and \(\iota :\mathbb {R}^d \rightarrow \mathbb {R}^d \) is a weighted function. We assume that \(\tilde{Q}_\pi \in \mathcal {F}_{\varUpsilon ,\infty }\) for all \(\pi \).
Assumption 4
(Regularity of stationary distribution). For any policy \(\pi \), and \(\forall x \in \mathbb {R}^d, \forall \Vert x\Vert _2=1\), and \(\forall u>0\), we assume that there exists a constant \(c > 0\) such that \( \mathbb {E}_{(s,a) \sim \sigma _\pi }\big [\mathbbm {1}\{|x^\top (s,a)|\le u\}\big ]\le c u. \)
Assumption 3 is a mild regularity condition on \(Q_\pi \), as \(\mathcal {F}_{\varUpsilon ,\infty }\) is a sufficiently rich function class and approximates a subset of the reproducing kernel Hilbert space (RKHS) [40]. Similar assumptions are widely imposed [4, 16, 38, 51, 58]. Assumption 4 is a regularity condition on the transition kernel \(\mathcal {P}\). Such regularity holds so long as \(\sigma _\pi \) has an upper bound density, satisfying most Markov chains.
In [62] Lemma 4.15, they make a mistake in the proof. They accidentally flip a sign in \(y^*-\bar{y}\) when transitioning from the first equation in the proof to Eq. (4.15). This invalidates the conclusion in Eq. (4.17), an essential part of the proof. We tackle this issue by proposing the next assumption.
Assumption 5
(Convergence Rate of \(J(\pi )\)). We assume \(\pi ^*\) (the optimal policy to the risk-averse objective function \(J_\lambda (\pi )\)) converges to the risk-neutral objective \(J(\pi )\) for both NPG and PPO with the over-parameterized neural network to be \(\mathcal {O}(1/\sqrt{k})\). Specifically, there exists a constant \(c_3>0\) such that,
It was proved [31, 51] that the optimal policy w.r.t the risk-neutral objective \(J(\pi )\) obtained by NPG and PPO method with the over-parameterized two-layer neural network converges to the globally optimal policy at a rate of \(\mathcal {O}(1/\sqrt{K})\), where K is the number of iteration. Since our method uses similar settings, we assume the convergence rates of risk-neutral objective \(J(\pi )\) in our paper follow their results.
In the following subsections, we study TOPS’s convergence of global optimality and provide a proof sketch.
1.3 D.3 Proof of Theorem 1
We first present the analysis of policy evaluation error, which is induced by TD update in Line 9 of Algorithm 1. We characterize the policy evaluation error in the following lemma:
Lemma 2
(Policy Evaluation Error). We set learning rate of TD \(\eta _{\text {TD}} = \min \{(1-\gamma )/3(1+\gamma )^2, 1/\sqrt{K_{\textrm{TD}}}\}\). Under Assumption 3 and 4, it holds that, with probability of \(1-\delta \),
where \(\tilde{Q}_{\pi _k}\) is the Q-value function of the augmented MDP, and \(\tilde{Q}_{\omega _k}\) is its estimator at the k-th iteration. We provide the proof and its supporting lemmas in Appendix D.6. In the following, we establish the error induced by the policy update. Equation (8) can be re-expressed as
It can be shown that \(\forall \pi , \max _{y}J_\lambda ^y (\pi ) = J_\lambda (\pi ) \) [55, 60]. We denote the optimal policy to the augmented MDP associated with \(y^*\) by \(\pi ^*(y^*)\). By definition, it is obvious that \(\pi ^*\) and \(\pi ^*(y^*)\) are equivalent. For simplicity, we will use the unified term \(\pi ^*\) in the rest of the paper. We present Lemma 3 and 4.
Lemma 3
(Policy’s Performance Difference). For mean-volatility objective w.r.t. auxiliary variable y as \(J^y_\lambda (\pi )\) defined in Eq. (15). For any policy \(\pi \) and \(\pi '\), we have the following,
where \(\tilde{Q}_{\pi ,y} \) is the state-action value function of the augmented MDP, and its rewards are associated with y.
Proof
When y is fixed,
We then follow Lemma 6.1 in [21]:
where \(\tilde{A}_\pi = \tilde{Q}_\pi - \tilde{V}_\pi \) is the advantage function of policy \(\pi \). Meanwhile,
From Eq. (16), Eq. (17) and Eq. (18), we complete the proof.
Lemma 3 is inspired by [21] and adopted by most work on global convergence [1, 31, 57]. Next, we derive an upper bound for the error of the critic update in Line 5 of Algorithm 1:
Lemma 4
(y Update Error). We characterize the error induced by the estimation of auxiliary variable y w.r.t the optimal value \(y^*\) at k-th iteration as, \( J^{y^*}_\lambda (\pi ^*)-J^{\hat{y}_k}_\lambda (\pi ^*) = \frac{2c_3 r_{\max }(1-\gamma )\lambda }{\sqrt{k}}, \) where \(r_{\max }\) is the bound of the original reward, and \(c_3\) is a constant error term.
Proof
We start from the subproblem objective defined in Eq. (15) with \(y^*\) and \(\hat{y}_k\):
where we obtain the final two equalities by the definition of \(J_\pi \) and y. Because \(r_{s,a}\) is upper-bounded by a constant \(r_{\max }\), we have \(|y^*-\hat{y}_k | \le 2r_{\max }\). Under Assumption 5 we have,
Thus we finish the proof.
From Lemma 3 and 4, we can also obtain the following Lemma.
Lemma 5
(Performance Difference on \(\pi \) and y). For mean-volatility objective w.r.t. auxiliary variable y as \(J^y_\lambda (\pi )\) defined in Eq. (15). For any \(\pi ,y\) and the optimal \(\pi *,y*\), we have the following,
where \(\tilde{Q}_{\pi ,y} \) is the state-action value function of the augmented MDP, and its rewards are associated with y.
Proof
It is easy to see that \(J^{y^*}_\lambda (\pi ^*) - J^y_\lambda (\pi ) = J^{y^*}_\lambda (\pi ^*) - J^y_\lambda (\pi ^*) + J^y_\lambda (\pi ^*) - J^{y}_\lambda (\pi )\). Then replace \(J^y_\lambda (\pi ^*) - J^{y}_\lambda (\pi )\) with Lemma 3 and \(J^{y^*}_\lambda (\pi ^*) - J^y_\lambda (\pi ^*)\) with Lemma 4, we finish the proof.
Lemma 5 quantifies the performance difference of \(J^{y}_\lambda (\pi )\) between any pair \(\pi ,y\) and the optimal \(\pi *,y*\), while Lemma 3 only quantifies the performance difference of \(J^{y}_\lambda (\pi )\) between \(\pi \) and \(\pi '\) when y is fixed.
We now study the global convergence of TOPS with neural PPO as the policy update component. First, we define the neural PPO update rule.
Lemma 6
[31]. Let \(\pi _{\theta _k} \propto \exp \{\tau ^{-1}_k f_{\theta _k}\}\) be an energy-based policy. We define the update
where \(Q_{\omega _k}\) is the estimator of the exact action-value function \(Q^{\pi _{\theta _k}}\). We have
And to represent \(\hat{\pi }_{k+1}\) with \(\pi _{\theta _{k+1}} \propto \exp \{\tau ^{-1}_{k+1} f_{\theta _{k+1}}\}\), we solve the following subproblem,
We analyze the policy improvement error in Line 13 of Algorithm 1. [31] proves that the policy improvement error can be characterized similarly to the policy evaluation error as in Eq. (14). Recall \(\tilde{Q}_{\omega _k}\) is the estimator of Q-value, \(f_{\theta _k}\) the energy function for policy, and \(f_{\hat{\theta }}\) its estimator. We characterize the policy improvement error as follows: Under Assumptions 3 and 4, we set the learning rate of PPO \(\eta _{\textrm{PPO}}=\min \{(1-\gamma )/3(1+\gamma )^2 1/\sqrt{K_{\textrm{TD}}}\}\), and with a probability of \(1-\delta \):
We quantify how the errors propagate in neural PPO [31] in the following.
Lemma 7
[31]. (Error Propagation) We have,
\(\varepsilon ''_{k}\) are defined in Eq. (14) as well as Eq. (19). \(\varphi ^*_{k} = \mathbb {E}_{(s,a) \sim \sigma _\pi }\bigg [\big (\frac{d\pi ^*}{d\pi _0}-\frac{d\pi _{\theta _k}}{d\pi _0}\big )^2\bigg ]^{1/2}, \psi ^*_{k} = \mathbb {E}_{(s,a) \sim \sigma _\pi }\bigg [\big (\frac{d\sigma _{\pi ^*}}{d\sigma _\pi }-\frac{d\nu _{\pi ^*}}{d\nu _\pi }\big )^2\bigg ]^{1/2}\). \(\frac{d\pi ^*}{d\pi _0},\frac{d\pi _{\theta _k}}{d\pi _0},\frac{d\sigma _{\pi ^*}}{d\sigma _\pi },\frac{d\nu _{\pi ^*}}{d\nu _\pi }\) are the Radon-Nikodym derivatives [23]. We denote RHS in Eq. (20) by \(\varepsilon _k = \tau ^{-1}_{k+1}\varepsilon ''_{k}\varphi ^*_{k+1} + \beta ^{-1}\varepsilon ''_{k}\psi ^*_{k}\). Lemma 7 essentially quantifies the error from which we use the two-layer neural network to approximate the action-value function and policy instead of having access to the exact ones. Please refer to [31] for complete proofs of Lemma 6 and 7.
We then characterize the difference between energy functions in each step [31]. Under the optimal policy \(\pi *\),
Lemma 8
[31]. (Stepwise Energy Function difference) Under the same condition of Lemma 7, we have
where \(\varepsilon '_k = |\mathcal {A}|\tau ^{-2}_{k+1}\epsilon ^2_{k+1}\)
and \(U = 2\mathbb {E}_{s \sim \nu _{\pi ^*}}[\max _{a\in \mathcal {A}}(\tilde{Q}_{\omega _{0}})^2] + 2\varUpsilon ^2\).
Proof
By the triangle inequality, we get the following,
We take the expectation of both sides of Eq. (22) with respect to \(s\sim \nu _{\pi ^*}\). With the 1-Lipshitz continuity of \(\tilde{Q}_{\omega _k}\) in \(\omega \) and \(\Vert \omega _k-\varTheta _\textrm{init}\Vert _2 \le \varUpsilon \), we have,
Thus complete the proof.
We then derive a difference term associated with \(\pi _{k+1}\) and \(\pi _{\theta _k}\), where at the k-th iteration \(\pi _{k+1}\) is the solution for the following subproblem,
and \(\pi _{\theta _k}\) is the policy parameterized by the two-layered over-parameterized neural network. The following lemma establishes the one-step descent of the KL-divergence in the policy space:
Lemma 9
(One-step difference of \(\pi \)). For \(\pi _{k+1}\) and \(\pi _{\theta _k}\), we have
Proof
We start from
Recall that \(\pi _{k+1} \propto \exp \{\tau ^{-1}_k f_{\theta _k}+\beta ^{-1} \tilde{Q}^y_{\pi _k}\}\). We define the two normalization factors associated with ideal improved policy \(\pi _{k+1}\) and the current parameterized policy \(\pi _{\theta _k}\) as,
We then have,
For any \(\pi , \pi '\) and k, we have,
Now we look back at a few terms on RHS from Eq. (24):
For Eq. (29), we obtain the first equality by Eq. (26). Then, by swapping Eq. (27) with Eq. (28), we obtain the second equality. We achieve the concluding step with the definition in Eq. (25). Following a similar logic, we have,
Finally, by using the Pinsker’s inequality [12], we have,
Plugging Eqs. (29), (30), and (31) into Eq. (24), we have
Rearranging the terms, we obtain Lemma 9.
Lemma 9 serves as an intermediate-term for the major result’s proof. We obtain upper bounds by telescoping this term in Theorem 1. Now we are ready to present the proof for Theorem 1.
Proof
First we take expectation of both sides of Eq. (23) with respect to \(s\sim \nu _{\pi ^*}\) from Lemma 9 and insert Eq. (20) to obtain,
Then, by Lemma 3, we have,
And with Hölder’s inequality, we have,
Insert Eqs. (33) and (34) into Eq. (32), we have,
The second inequality holds by using the inequality \(2AB - B^2\le A^2\), with a minor abuse of notations. Here, \(A := \Vert \tau ^{-1}_{k+1} f_{\theta _{k+1}}- \tau ^{-1}_{k} f_{\theta _{k}}\Vert _{\infty }\) and \(B := \Vert \pi _{\theta _{k}}-\pi _{\theta _{k+1}}\Vert _1\). Then, by plugging in Lemma 4 and Eq. (21) we end up with,
Rearrange Eq. (35), we have
And then telescoping Eq. (36) results in,
We complete the final step in Eq. (37) by plugging in Lemma 4 and Eq. (20). Per the observation we make in the proof of Theorem 2,
-
1.
\(\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{0})] \le \log \mathcal {A}\) due to the uniform initialization of policy.
-
2.
\(\text {KL}(\pi ^*\Vert \pi _{K})\) is a non-negative term.
We now have,
Replacing \(\beta \) with \(\beta _0\sqrt{K}\) finishes the proof.
1.4 D.4 Proof of Theorem 2
In the following part, we focus the convergence of neural NPG. We first define the following terms under neural NPG update rule.
Lemma 10
[51]. For energy-based policy \(\pi _\theta \), we have policy gradient and Fisher information matrix,
We then derive an upper bound for \(J^{y^*}_{\lambda }(\pi ^*)-J^{y^*}_{\lambda }(\pi _k)\) for the neural NPG method in the following lemma:
Lemma 11
(One-step difference of \(\pi \)). It holds that, with probability of \(1-\delta \),
\(c_0\) is defined in Assumption 2 and \(\sigma _\xi \) is defined in Assumption 1. Meanwhile, \(\varUpsilon \) is the radius of the parameter space, m is the width of the neural network, and T is the sample batch size.
Proof
We start from the following,
We now show the building blocks of the proof. First, we add and subtract a few terms to RHS of Eq. (38) then take the expectation of both sides with respect to \(s\sim \nu _{\pi ^*}\). Rearrange these terms, we get,
where \(H_k\) is denoted by,
By Lemma 3, we have
Insert Eqs. (41) back to Eq. (39), we have,
We reach the final inequality of Eq. (42) by algebraic manipulation. Second, we follow Lemma 5.5 of [51] and obtain an upper bound for Eq. (40). Specifically, with probability of \(1-\delta \),
The expectation is taken over randomness. With these building blocks of Eqs. (42) and (43), we are now ready to reach the concluding inequality. Plugging Eqs. (43) back into Eq. (42), we end up with, with probability of \(1-\delta \),
Dividing both sides of Eq. (44) by \(\eta _{\textrm{NPG}}\) completes the proof. The details are included in the Appendix.
We have the following Lemma to bound the error terms \(H_k\) defined in Eq. (40) of Lemma 11.
Lemma 12
[51]. Under Assumptions 4, we have
Here the expectation is taken over all the randomness. We have \(\epsilon '_{k}:=\Vert Q_{\omega _k}-Q_{\pi _k}\Vert ^2_{\nu _{\pi _k}}\) and
Recall \(\xi _k(\omega _k)\) and \(\xi _k(\omega _k)\) are defined in Assumption 1, while \(\varphi _k\),\(\psi _k\), \(\varphi '_k\), and \(\psi _k\) are defined in Assumption 2.
Please refer to [51] for complete proof. Finally, we are ready to show the proof for Theorem 2.
Proof
First, we combine Lemma 4 and 11 to get the following:
We can then see this:
-
1.
\(\mathbb {E}_{s \sim \nu _{\pi ^*}}[\text {KL}(\pi ^*\Vert \pi _{1})] \le \log |\mathcal {A}|\) due to the uniform initialization of policy.
-
2.
\(\text {KL}(\pi ^*\Vert \pi _{K+1})\) is a non-negative term.
And by setting \(\eta _{\textrm{NPG}}=1/\sqrt{K}\) and telescoping Eq. (45), we obtain,
plug \(\epsilon '_{k}\) and \(\epsilon ''_{k}\) defined in Lemma 11 into Eq. (46), and set \(\epsilon _k\) as,
we complete the proof.
1.5 D.5 Proof of Lemma 1
Proof
First, we have \(\mathbb {E}[G] = \frac{1}{1-\gamma }\mathbb {E}[R]\), i.e., the per-step reward R is an unbiased estimator of the cumulative reward G. Second, it is proved that \(\mathbb {V}(G) \le \frac{\mathbb {V}(R)}{(1-\gamma )^2}\) [7]. Given \(\lambda \ge 0\), summing up the above equality and inequality, we have
It completes the proof.
1.6 D.6 Proof of Lemma 2
We first provide the supporting lemmas for Lemma 2. We define the local linearization of \(f((s,a);\theta )\) defined in Eq. (4) at the initial point \(\varTheta _{\textrm{init}}\) as,
We then define the following function spaces,
and
\([\varTheta _{\textrm{init}}]_r\sim \mathcal {N}(0,I_d/d)\) and \(b_r\sim \text {Unif}(\{-1,1\})\) are the initial parameters. By the definition, \(\bar{\mathcal {F}}_{\varUpsilon ,m}\) is a subset of \(\mathcal {F}_{\varUpsilon ,m}\). The following lemma characterizes the deviation of \(\bar{\mathcal {F}}_{\varUpsilon ,m}\) from \(\mathcal {F}_{\varUpsilon ,\infty }\).
Lemma 13
(Projection Error) [40]. Let \(f\in \mathcal {F}_{\varUpsilon ,\infty }\), where \(\mathcal {F}_{\varUpsilon ,\infty }\) is defined in Assumption 3. For any \(\delta >0\), it holds with probability at least \(1-\delta \) that
where \(\varsigma \) is any distribution over \(S \times A\).
Please refer to [40] for a detail proof.
Lemma 14
(Linearization Error). Under Assumption 4, for all \(\theta \in \mathcal {D}\), where \(\mathcal {D} = \{\xi \in \mathbb {R}^{md}:\Vert \xi -\varTheta _{\text {init}} \Vert _2 \le \varUpsilon \}\), it holds that,
where \(c_1 = c\sqrt{\mathbb {E}_{\mathcal {N}(0,I_d/d)}[1/\Vert (s,a)\Vert _2^2]}\), and c is defined in Assumption 4.
Proof
We start from the definitions in Eq. (4) and Eq. (47),
The above inequality holds because the fact that \(\vert \sum W \vert \le \sum \vert W \vert \), where \(W = \big ((\mathbbm {1}\{[\theta ]_v^\top (s,a)>0\} - \mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)>0\}) b_v [\theta ]_v^\top (s,a)\big )\). \(\varTheta _{\textrm{init}}\) is defined in Eq. (5). Next, since \(\mathbbm {1}\{[\varTheta _{\textrm{init}}]_v^\top (s,a)>0\} \ne \mathbbm {1}\{[\theta ]_v^\top (s,a)>0\}\), we have,
where we obtain the last inequality from the Cauchy-Schwartz inequality. We also assume that \(\Vert (s,a)\Vert _2 \le 1\) without loss of generality [31, 51]. Equation (49) further implies that,
Then plug Eq. (50) and the fact that \(|b_v|\le 1\) back to Eq. (48), we have the following,
We obtain the second inequality by the fact that \(|A|\le |A-B|+|B|\). Then follow the Cauchy-Schwartz inequality and \(\Vert (s,a)\Vert _2 \le 1\) we have the third equality. By inserting Eq. (49) we achieve the fourth inequality. We continue Eq. (51) by following the Cauchy-Schwartz inequality and plugging \(\big \Vert [\theta ] - [\varTheta _{\textrm{init}}]\big \Vert _2 \le \varUpsilon \),
We obtain the second inequality by imposing Assumption 4 and the third by following the Cauchy-Schwartz inequality. Finally, we set \(c_1 := c\sqrt{\mathbb {E}_{\mathcal {N}(0,I_d/d)}[1/\Vert (s,a)\Vert _2^2]} \). Thus, we complete the proof.
In the t-th iterations of TD iteration, we denote the temporal difference terms w.r.t \(\hat{f}((s,a);\theta _t)\) and \(f((s,a);\theta _t)\) as
For notation simplicity in the sequel we write \(\delta _t^0((s,a),(s,a)';\theta _t)\) and \(\delta _t^\theta ((s,a),(s,a)';\theta _t)\) as \(\delta _t^0\) and \(\delta _t^\theta \). We further define the stochastic semi-gradient \(g_t(\theta _t):=\delta _t^\theta \nabla _{\theta } f((s,a);\theta _t)\), its population mean \(\bar{g}_t(\theta _t):=\mathbb {E}_{\nu _\pi }[g_t(\theta _t)]\). The local linearization of \(\bar{g}_t(\theta _t)\) is \(\hat{g}_t(\theta _t):=\mathbb {E}_{\nu _\pi }[\delta _t^0 \nabla _{\theta } \hat{f}((s,a);\theta _t)]\). We denote them as \(g_t, \bar{g}_t, \hat{g}_t\) respectively for simplicity.
Lemma 15
Under Assumption 4, for all \(\theta _t \in \mathcal {D}\), where \(\mathcal {D} = \{\xi \in \mathbb {R}^{md}:\Vert \xi -\varTheta _{\text {init}} \Vert _2 \le \varUpsilon \}\), it holds with probability of \(1-\delta \) that,
Proof
By the definition of \(\bar{g}_t\) and \(\hat{g}_t\), we have
We obtain the inequality because \((A+B)^2 \le 2A^2+2B^2\). We first upper bound \(\mathbb {E}_{\nu _\pi }\big [(\delta _t^\theta -\delta _t^0)^2 \Vert \nabla _{\theta } f((s,a);\theta _t)\Vert _2^2\big ]\) in Eq. (53). Since \(\Vert (s,a)\Vert _2 \le 1\), we have \(\Vert \nabla _{\theta } f((s,a);\theta _t)\Vert _2 \le 1\). Then by definition, we have the following first inequality,
We obtain the second inequality by \(|\gamma | \le 1\), then obtain the third inequality by the fact that \((A+B)^2 \le 2A^2+2B^2\). We reach the final step by inserting Lemma 14. We then proceed to upper bound \(\mathbb {E}_{\nu _\pi }\big [|\delta _t^0| \Vert \nabla _{\theta } f((s,a);\theta _t)-\nabla _{\theta } \hat{f}((s,a);\theta _t))\Vert _2\big ]\). From Hölder’s inequality, we have,
We first derive an upper bound for first term in Eq. (55), starting from its definition,
We obtain the first and the third inequality by the fact that \((A+B+C)^2 \le 3A^2+3B^2+3C^2\). Recall \(r_{\max }\) is the boundary for reward function r, which leads to the second inequality. We obtain the last inequality in Eq. (56) following the fact that \(|\hat{f}((s,a);\theta _t)-\hat{f}((s,a);\theta _{\pi ^*})| \le \Vert \theta _t-\theta _{\pi ^*}\Vert \le 2\varUpsilon \) and \(Q_\pi \le (1-\gamma )^{-1}r_{\max }\). Since \(\bar{\mathcal {F}}_{\varUpsilon ,m} \subset \mathcal {F}_{\varUpsilon ,m}\), by Lemma 13, we have,
Combine Eq. (56) and Eq. (57), we have with probability of \(1-\delta \),
Lastly we have
We obtain the first inequality by following Eq. (50) and the fact that \(|b_v| \le 1\) and \(\Vert (s,a)\Vert _2 \le 1\). Then for the rest, we follow the similar argument in Eq. (52). To finish the proof, we plug Eq. (54), Eq. (58) and Eq. (59) back to Eq. (53),
Then we have,
Next, we provide the following lemma to characterize the variance of \(g_t\).
Lemma 16
(Variance of the Stochastic Update Vector) [31]. There exists a constant \(\xi _g^2=\mathcal {O}(\varUpsilon ^2)\) independent of t. Such that for any \(t \le T\), it holds that
A detailed proof can be found in [31]. Now we provide the proof for Lemma 2.
Proof
The inequality holds due to the definition of \(\varPi _\mathcal {D}\). We first upper bound \(\big \Vert g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big \Vert _2^2\) in Eq. (60),
The inequality holds due to fact that \((A+B+C)^2 \le 3A^2+3B^2+3C^2\). Two of the terms on the right hand side of Eq. (61) are characterized in Lemma 15 and Lemma 16. We therefore characterize the remaining term,
We obtain the first inequality by the fact that \(\Vert \nabla _{\theta }\hat{f}((s,a);\theta _t)\Vert _2 \le 1\). Then we use the fact that (s, a) and \((s',a')\) have the same marginal distribution as well as \(\gamma < 1\) for the second inequality. Follow the Cauchy-Schwarz inequality and the fact that (s, a) and \((s',a')\) have the same marginal distribution, we have
We plug Eq. (63) back to Eq. (62),
Next, we upper bound \(\big ( g_t(\theta _t)-\hat{g}_t(\theta _{\pi ^*})\big )^\top \big (\theta _t-\theta _{\pi ^*}\big )\). We have,
One term on the right hand side of Eq. (65) are characterized by Lemma 16. We continue to characterize the remaining terms. First, by Hölder’s inequality, we have
We obtain the second inequality since \(\big \Vert \theta _t-\theta _{\pi ^*}\big \Vert _2 \le 2\varUpsilon \) by definition. For the last term,
where the inequality follows from Eq. (63). Combine Eqs. (60), (61), (64), (65), (66) and (67), we have,
We then bound the error terms by rearrange Eq. (68). First, we have, with probability of \(1-\delta \),
where
We obtain the first inequality by the fact that \((A+B)^2\le 2A^2 + 2B^2\). Then by Eq. (68), Lemma 14 and Lemma 15, we reach the final inequality. By telescoping Eq. (69) for \(t = \) to T, we have, with probability of \(1-\delta \),
Set \(\eta =\min \{1/\sqrt{T}, (1-\gamma )/3(1+\gamma )^2\}\), which implies that \(T^{-1/2}(2\eta (1-\gamma )-3\eta ^2(1+\gamma )^2)^{-1} \le 1/(1-\gamma )^2\), then we have, with probability of \(1-\delta \),
We obtain the second inequality by the fact that \(\Vert \varTheta _{\textrm{init}}-\theta _{\pi ^*} \Vert _2\le \varUpsilon \). Then by definition we replace \(\tilde{Q}_{\omega _k}\) and \(\tilde{Q}_{\pi _k}\)
E Additional Related Work
1.1 E.1 Global Optimality of Policy Search Methods
A major challenge of existing RL research is the lack of theoretical justification, such as sample complexity analysis, mainly because the objective function of policy search in RL is often nonconvex. It is challenging to determine if a policy search approach is guaranteed to reach the global optimal. Besides, the RL architecture components are usually parameterized by neural networks in practice. Its nonlinearity and complex nature render the analysis significantly difficult [62].
The theoretical understanding of policy gradient methods is also under tentative study. Work on this topic has been done mostly in tabular and linear parametrization settings for different variants of policy gradient. For example, [11] and [44] establish a non-asymptotic convergence guarantee for natural policy gradient (NPG, [22]) and trusted region policy optimization (TRPO, [42]), respectively. [35] show converge rate for softmax parametrization, while [1] analyze multiple variants of policy gradient. On the other side of the spectrum, [31, 51] prove the global convergence and optimality of various policy gradient algorithms with over-parameterized neural networks. Furthermore, [62] apply the global optimality analysis to variance-constrained actor-critic risk-averse control with cumulative average rewards, and proposed a corresponding variance-constrained actor-critic (VARAC) algorithm. However, the analysis procedure is complicated due to the risk constraints on cumulative rewards, and the algorithm’s experimental performance remains unverified. Therefore, it remains interesting if there can be simplified global optimality analysis with verifiable experimental studies for risk-averse policy search methods.
1.2 E.2 Over-Parameterized Neural Networks in RL
Overparameterization, a technique of deploying more parameters than necessary, improves the performance of neural networks [59]. The learning ability and generalization of over-parameterized neural networks have been studied extensively [2, 5, 15]. Integration with over-parameterized neural networks can be found in multiple RL topics. One line of work is to prove the global optimality of RL algorithms in a non-linear approximation setting [31, 51, 62]. They use ReLU activation over-parameterized neural networks with policy gradient methods such as NPG and PPO. Our work also belongs to this category. Other works include [19], which also deploy a two-layered ReLU activation over-parameterized neural network on mean-field multi-agent reinforcement learning problem. Regularization with over-parameterized neural networks is also investigated recently [25, 41].
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, L., Lyu, D., Pan, Y., Jiang, A., Liu, B. (2022). TOPS: Transition-Based Volatility-Reduced Policy Search. In: Melo, F.S., Fang, F. (eds) Autonomous Agents and Multiagent Systems. Best and Visionary Papers. AAMAS 2022. Lecture Notes in Computer Science(), vol 13441. Springer, Cham. https://doi.org/10.1007/978-3-031-20179-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-20179-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20178-3
Online ISBN: 978-3-031-20179-0
eBook Packages: Computer ScienceComputer Science (R0)