Student-t policy in reinforcement learning to acquire global optimum of robot control

Kobayashi, Taisuke

doi:10.1007/s10489-019-01510-8

Student-t policy in reinforcement learning to acquire global optimum of robot control

Published: 15 June 2019

Volume 49, pages 4335–4347, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Taisuke Kobayashi ORCID: orcid.org/0000-0002-3760-249X¹

744 Accesses
25 Citations
Explore all metrics

Abstract

This paper proposes an actor-critic algorithm with a policy parameterized by student-t distribution, named student-t policy, to enhance learning performance, mainly in terms of reachability on global optimum for tasks to be learned. The actor-critic algorithm is one of the policy-gradient methods in reinforcement learning, and is proved to learn the policy converging on one of the local optima. To avoid the local optima, an exploration ability to escape it and a conservative learning not to be trapped in it are deemed to be empirically effective. The conventional policy parameterized by a normal distribution, however, fundamentally lacks these abilities. The state-of-the-art methods can somewhat but not perfectly compensate for them. Conversely, heavy-tailed distribution, including student-t distribution, possesses an excellent exploration ability, which is called Lévy flight for modeling efficient feed detection of animals. Another property of the heavy tail is its robustness to outliers. Namely, conservative learning is performed to not be trapped in the local optima even when it takes extreme actions. These desired properties of the student-t policy enhance the possibility of the agents reaching the global optimum. Indeed, the student-t policy outperforms the conventional policy in four types of simulations, two of which are difficult to learn faster without sufficient exploration and the others have the local optima.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ACRE: Actor-Critic with Reward-Preserving Exploration

Article Open access 14 August 2023

Stable Training of Bellman Error in Reinforcement Learning

Reinforcement Learning

References

Achiam J, Held D, Tamar A, Abbeel P (2017) Constrained policy optimization. In: International conference on machine learning, pp 22–31
Aeschliman C, Park J, Kak AC (2010) A novel parameter estimation algorithm for the multivariate t-distribution and its application to computer vision. In: European conference on computer vision, pp 594–607. Springer
Amari SI (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276
Article Google Scholar
Arellano-Valle RB (2010) On the information matrix of the multivariate skew-t model. Metron 68(3):371–386
Article MathSciNet Google Scholar
Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13(5):834–846
Article Google Scholar
Bartumeus F, da Luz ME, Viswanathan G, Catalan J (2005) Animal search strategies: A quantitative random-walk analysis. Ecology 86(11):3078–3087
Article Google Scholar
Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Advances in neural information processing systems, pp 1471–1479
Canal L (2005) A normal approximation for the chi-square distribution. Comput Stat Data Anal 48(4):803–808
Article MathSciNet Google Scholar
Chentanez N, Barto AG, Singh SP (2005) Intrinsically motivated reinforcement learning. In: Advances in neural information processing systems, pp 1281–1288
Chou PW, Maturana D, Scherer S (2017) Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In: International conference on machine learning, pp 834–843
Contreras-Reyes JE (2014) Asymptotic form of the Kullback–Leibler divergence for multivariate asymmetric heavy-tailed distributions. Physica A: Statistical Mechanics and its Applications 395:200–208
Article MathSciNet Google Scholar
Cui Y, Matsubara T, Sugimoto K (2017) Kernel dynamic policy programming: Applicable reinforcement learning to robot systems with high dimensional states. Neural Netw 94:13–23
Article Google Scholar
Daniel C, Neumann G, Kroemer O, Peters J (2016) Hierarchical relative entropy policy search. J Mach Learn Res 17(93):1–50
MathSciNet MATH Google Scholar
Gu S, Lillicrap T, Turner RE, Ghahramani Z, Schölkopf B., Levine S (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In: Advances in neural information processing systems, pp 3849–3858
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290
Heess N, Sriram S, Lemmon J, Merel J, Wayne G, Tassa Y, Erez T, Wang Z, Eslami A, Riedmiller M et al (2017) Emergence of locomotion behaviours in rich environments. arXiv:1707.02286
Hirai K, Hirose M, Haikawa Y, Takenaka T (1998) The development of honda humanoid robot. In: IEEE international conference on robotics and automation, vol 2, pp 1321–1326. IEEE
Houthooft R, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P (2016) VIME: Variational information maximizing exploration. In: Advances in neural information processing systems, pp 1109–1117
Hwangbo J, Lee J, Dosovitskiy A, Bellicoso D, Tsounis V, Koltun V, Hutter M (2019) Learning agile and dynamic motor skills for legged robots. Sci Robot 4(26):eaau5872
Article Google Scholar
Kakade SM (2002) A natural policy gradient. In: Advances in neural information processing systems, pp 1531–1538
Kingma D, Ba J (2015) Adam: A method for stochastic optimization. In: International conference for learning representations, pp 1–15
Kobayashi T, Aoyama T, Sekiyama K, Fukuda T (2015) Selection algorithm for locomotion based on the evaluation of falling risk. IEEE Trans Robot 31(3):750–765
Article Google Scholar
Lange KL, Little RJ, Taylor JM (1989) Robust statistical modeling using the t distribution. J Am Stat Assoc 84(408):881–896
MathSciNet Google Scholar
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv:1509.02971
Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: Theory and application to reward shaping. In: International conference on machine learning, vol 99, pp 278–287
Rohmer E, Singh SP, Freese M (2013) V-rep: A versatile and scalable robot simulation framework. In: IEEE/RSJ international conference on intelligent robots and systems, pp 1321–1326. IEEE
Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. In: International conference for learning representations, pp 1–14
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347
Shah A, Wilson A, Ghahramani Z (2014) Student-t processes as alternatives to gaussian processes. In: Artificial intelligence and statistics, pp 877–885
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, pp 387–395
Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. MIT Press, Cambridge
MATH Google Scholar
Svensén M, Bishop CM (2005) Robust bayesian mixture modelling. Neurocomputing 64:235–252
Article Google Scholar
Thomas P (2014) Bias in natural actor-critic algorithms. In: International conference on machine learning, pp 441–448
Tsurumine Y, Cui Y, Uchibe E, Matsubara T (2019) Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robot Auton Syst 112:72–83
Article Google Scholar
Van Seijen H, Mahmood AR, Pilarski PM, Machado MC, Sutton RS (2016) True online temporal-difference learning. J Mach Learn Res 17(145):1–40
MathSciNet MATH Google Scholar
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3-4):229–256
Article Google Scholar
Zhao X, Ding S, An Y, Jia W (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights. Appl Intell 49(2):581–591
Article Google Scholar

Download references

Acknowledgments

This work was supported by JSPS KAKENHI, Grant-in-Aid for Young Scientists (B), Grant Number 17K12759.

Author information

Authors and Affiliations

Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara, 630-0192, Japan
Taisuke Kobayashi

Authors

Taisuke Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taisuke Kobayashi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Details of the update rule

2.1 A.1 The gradient of PPO

The state-of-the-art method PPO [31] aims to maximize the following expected value by optimizing the policy parameter w_A.

$$ \begin{array}{@{}rcl@{}} &&\max_{\boldsymbol{w}_{A}} \mathbb{E}_{t} \left[ \hat r_{t}\hat A_{t} - \beta_{1} D_{KL}(\pi(\cdot \mid s_{t}, \boldsymbol{w}_{A,t-1}) \mid \pi(\cdot \mid s_{t}, \boldsymbol{w}_{A,t}))\right.\\ &&+ \left. \beta_{2} H(\pi(\cdot \mid s_{t}, \boldsymbol{w}_{A,t})) \right] \end{array} $$

(32)

where $\hat r_{t} = \frac {\pi (a_{t} \mid s_{t}, \boldsymbol {w}_{A,t})}{\pi (a_{t} \mid s_{t}, \boldsymbol {w}_{A,t-1})}$ represents the importance sampling and $\hat A_{t}$ is the advantage function approximated by GAE [30]. The gradient of this term corresponds to the term excluding PPO(⋅) from (7).

D_KL(⋅) and H(⋅) indicate the KL divergence and differential entropy with β_1,2 coefficients, respectively. Therefore, in (7), PPO(⋅) corresponds to the gradients of D_KL(⋅) and H(⋅). While the policy update is suppressed by the KL penalty, the entropy bonus encourages exploration. In addition, β₁ is adjusted by a simple heuristic approach in the original paper. The approach of adjusting β₁ is modified as follows:

$$ \beta_{1} \leftarrow \beta_{1} \exp\left( \frac{\bar D_{KL}^{\text{new}} - \bar D_{KL}^{\text{old}}}{\bar D_{KL}^{\text{new}} + \bar D_{KL}^{\text{old}}} \right) $$

(33)

where $\bar D_{KL}^{\mathrm {old,new}}$ are the moving averages before and after the update by the current D_KL(⋅).

Note that all D_KL(⋅) and H(⋅), except D_KL(⋅) for the student-t policy, can be analytically derived. In addition, their gradients are also analytically calculated. D_KL(⋅) for the student-t policy, however, needs to be approximated as a closed-form solution depending on ref. [11] to derive its gradient analytically.

Given $p_{1} \sim \mathcal {T}(\boldsymbol {\mu }_{1} , {\Sigma }_{1} , \nu _{1})$ and $p_{2} \sim \mathcal {T}(\boldsymbol {\mu }_{2} , {\Sigma }_{2} , \nu _{2})$, the KL divergence between them is approximated as follows:

$$ \begin{array}{@{}rcl@{}} D_{KL}(p_{1} \mid p_{2}) &\simeq& \frac{1}{2}\ln\frac{|{\Sigma}_{2}|}{|{\Sigma}_{1}|} + \frac{1}{2}\frac{\nu_{2}+d}{\nu_{2}}\frac{\nu_{1}}{\nu_{1}-2}\text{tr}({\Sigma}_{2}^{-1}{\Sigma}_{1}) \\ &&+ \frac{1}{2}\frac{\nu_{2}+d}{\nu_{2}}(\boldsymbol{\mu}_{1} - \boldsymbol{\mu}_{2})^{\top}{\Sigma}_{2}^{-1}(\boldsymbol{\mu}_{1} - \boldsymbol{\mu}_{2})\\ &&- \frac{\nu_{1}+d}{2}\left\{ \psi\left( \frac{\nu_{1}+d}{2}\right) - \psi\left( \frac{\nu_{1}}{2}\right) \right\} \end{array} $$

(34)

Note that ν₁ should be larger than 2, although the student-t policy can have ν smaller than 2. In this paper, its gradient is calculated by adding 2 − ν₀ as an offset to ν.

2.2 A.2 True online GAE

The original TD(λ) [34] recursively updates the weights for the value function w_C by using the eligibility trace e.

$$ \begin{array}{@{}rcl@{}} \boldsymbol{e}_{C,t} &=& \gamma \lambda \boldsymbol{e}_{C,t-1} + \boldsymbol{x}(s_{t}) \end{array} $$

(35)

$$ \begin{array}{@{}rcl@{}} \boldsymbol{g}_{C,t} &=& \delta_{t} \boldsymbol{e}_{C,t} \end{array} $$

(36)

$$ \begin{array}{@{}rcl@{}} \boldsymbol{w}_{C,t+1} &=& \boldsymbol{w}_{C,t} + \alpha \boldsymbol{g}_{C,t} \end{array} $$

(37)

This TD(λ) includes an approximation, and therefore, the true online TD(λ) [38] has derived the exact recursive update rule (see (2)–(4)).

GAE approximates the advantage function as $\hat A$ by accumulating the TD error δ.

$$ \hat A_{t} = \sum\limits_{k=0}^{\infty} (\gamma \lambda)^{k} \delta_{t+k} $$

(38)

By using this GAE [30] and REINFORCE [39], the offline policy gradient g_A is given as follows:

$$ \begin{array}{@{}rcl@{}} \boldsymbol{g}_{A} &=& \sum\limits_{t=0}^{\infty} \hat A_{t} \nabla_{\boldsymbol{w}_{A}}\ln\pi(a_{t} \mid s_{t}, \boldsymbol{w}_{A}) \end{array} $$

(39)

$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{t=0}^{\infty} \boldsymbol{\hat x}(s_{t},a_{t}) \sum\limits_{k=0}^{\infty} (\gamma \lambda)^{k} \delta_{t+k} \end{array} $$

(40)

$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{t=0}^{\infty} \delta_{t} \sum\limits_{k=0}^{t} (\gamma \lambda)^{k} \boldsymbol{\hat x}(s_{t_{k}},a_{t-k}) \end{array} $$

(41)

where $\boldsymbol {\hat x}$ is defined in (5). Here, $\boldsymbol {e}_{A,t} = {\sum }_{k=0}^{t}(\gamma $$\lambda )^{k} \boldsymbol {\hat x}(s_{t_{k}},a_{t-k})$ is defined, and consequently, this policy gradient can be recursively approximated by the following equations.

$$ \begin{array}{@{}rcl@{}} \boldsymbol{e}_{A,t} &=& \gamma \lambda \boldsymbol{e}_{A,t-1} + \boldsymbol{\hat x}(s_{t}, a_{t}) \end{array} $$

(42)

$$ \begin{array}{@{}rcl@{}} \boldsymbol{g}_{A,t} &=& \delta_{t} \boldsymbol{e}{A,t} \end{array} $$

(43)

$$ \begin{array}{@{}rcl@{}} \boldsymbol{w}_{A,t+1} &=& \boldsymbol{w}_{A,t} + \alpha \boldsymbol{g}_{A,t} \end{array} $$

(44)

As can be seen from the above equations, the update rule of the actor by the recursive GAE is equivalent to the update rule of the critic by the original TD(λ). Hence, the true online version of GAE can be derived as shown in (5)–(8).

B Task specifications

All the simulation environments have been created in V-REP [29] and uploaded in Github: https://github.com/kbys-t/gym_vrepto reproduce the results. Here, let us introduce their designs of state, action, reward, and end conditions.

3.1 B.1 (a) The rolling balance task

An agent, a half-sized humanoid robot NAO, tries to balance on a board on a cylinder. The agent stands at the center of the board horizontal to the ground at the start of an episode. State space is defined by three states: roll angle of the agent; roll angle of the board, and angular velocity of the roll angle of the board. Action space is defined by one action: an angular velocity of the roll angle of the agent. If the board comes in contact with the ground, the episode is terminated as a failure. In the case of the failure, the agent receives a big penalty of − 100. At every step of the episode, the agent gets a reward as the board and ground are almost parallel (the maximum is 1). In this task, local exploration is enough to reach the global optimum, although a highly efficient exploration would be able to maintain the balance faster.

3.2 B.2 (b) The task to navigate ballbot

An agent, an omnidirectional mobile robot on a ball, tries to reach a goal on (1.0, 1.5) while avoiding a set of dining table and chairs. The agent is stopped on (− 1.5, 1.5) at the start of an episode. State space is defined by four states: two-dimensional (2D) positions of the agent and 2D differences between the agent and a reference point. Action space is defined by two actions: 2D velocities of the reference point. If the agent falls over, the episode will be terminated as a failure with penalty − 1. At every step of the episode, the agent gets a reward according to the distance between the agent and the goal (the maximum is 1). The agent cannot recognize the set of dining table and chairs, which hinders the shortest path to the goal; hence, it is required to find a detour route to the goal only by sufficient exploration.

3.3 B.3 (c) The task to open door

An agent, a manipulator with seven degrees of freedom in front of a door, tries to open the door after turning its knob. The agent’s hand is on the knob at the start of an episode. State space is defined by five states: 3D positions of the agent’s hand, the angle of the knob, and the angle of the door. Action space is defined by three actions: 3D velocities of the agent’s hand. If the agent perfectly opens the door, the episode will be terminated as success without bonus. At every step of the episode, the agent gets a small reward according to the angle of the knob (the maximum is 0.05) and a large reward according to the angle of the door (the maximum is 1). Since the agent simultaneously gets two rewards, this task is prone to the local optima. For example, the agent turns the knob but does not push the door because when pushing the door, the agent’s hand is likely to get away from the knob.

3.4 B.4 (d) The task to reach targets

An agent, two planar manipulators, one of which is installed on the first link of another, tries to make their tips reach the respective targets, i.e., (− 0.5, 0) and (0.5, 0), simultaneously. State space is defined by four states: the angles of respective joints. Action space is defined by four actions: the angular velocities of the respective joints. If both tips reach within a radius of 0.1 m of the respective targets simultaneously, the agent gets a large bonus of 100 and the episode is terminated as a success. At every step of the episode, the agent cannot get any reward; namely, the reward is obtained only when the task succeeds. Such sparse reward makes it difficult to resolve the task by reinforcement learning with poor exploration ability.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kobayashi, T. Student-t policy in reinforcement learning to acquire global optimum of robot control. Appl Intell 49, 4335–4347 (2019). https://doi.org/10.1007/s10489-019-01510-8

Download citation

Published: 15 June 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10489-019-01510-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Student-t policy in reinforcement learning to acquire global optimum of robot control

Abstract

Access this article

Similar content being viewed by others

ACRE: Actor-Critic with Reward-Preserving Exploration

Stable Training of Bellman Error in Reinforcement Learning

Reinforcement Learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendices

A Details of the update rule

2.1 A.1 The gradient of PPO

2.2 A.2 True online GAE

B Task specifications

3.1 B.1 (a) The rolling balance task

3.2 B.2 (b) The task to navigate ballbot

3.3 B.3 (c) The task to open door

3.4 B.4 (d) The task to reach targets

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Student-t policy in reinforcement learning to acquire global optimum of robot control

Abstract

Access this article

Similar content being viewed by others

ACRE: Actor-Critic with Reward-Preserving Exploration

Stable Training of Bellman Error in Reinforcement Learning

Reinforcement Learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendices

A Details of the update rule

2.1 A.1 The gradient of PPO

2.2 A.2 True online GAE

B Task specifications

3.1 B.1 (a) The rolling balance task

3.2 B.2 (b) The task to navigate ballbot

3.3 B.3 (c) The task to open door

3.4 B.4 (d) The task to reach targets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation