Skip to main content
Log in

Student-t policy in reinforcement learning to acquire global optimum of robot control

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper proposes an actor-critic algorithm with a policy parameterized by student-t distribution, named student-t policy, to enhance learning performance, mainly in terms of reachability on global optimum for tasks to be learned. The actor-critic algorithm is one of the policy-gradient methods in reinforcement learning, and is proved to learn the policy converging on one of the local optima. To avoid the local optima, an exploration ability to escape it and a conservative learning not to be trapped in it are deemed to be empirically effective. The conventional policy parameterized by a normal distribution, however, fundamentally lacks these abilities. The state-of-the-art methods can somewhat but not perfectly compensate for them. Conversely, heavy-tailed distribution, including student-t distribution, possesses an excellent exploration ability, which is called Lévy flight for modeling efficient feed detection of animals. Another property of the heavy tail is its robustness to outliers. Namely, conservative learning is performed to not be trapped in the local optima even when it takes extreme actions. These desired properties of the student-t policy enhance the possibility of the agents reaching the global optimum. Indeed, the student-t policy outperforms the conventional policy in four types of simulations, two of which are difficult to learn faster without sufficient exploration and the others have the local optima.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Achiam J, Held D, Tamar A, Abbeel P (2017) Constrained policy optimization. In: International conference on machine learning, pp 22–31

  2. Aeschliman C, Park J, Kak AC (2010) A novel parameter estimation algorithm for the multivariate t-distribution and its application to computer vision. In: European conference on computer vision, pp 594–607. Springer

  3. Amari SI (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276

    Article  Google Scholar 

  4. Arellano-Valle RB (2010) On the information matrix of the multivariate skew-t model. Metron 68(3):371–386

    Article  MathSciNet  Google Scholar 

  5. Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13(5):834–846

    Article  Google Scholar 

  6. Bartumeus F, da Luz ME, Viswanathan G, Catalan J (2005) Animal search strategies: A quantitative random-walk analysis. Ecology 86(11):3078–3087

    Article  Google Scholar 

  7. Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Advances in neural information processing systems, pp 1471–1479

  8. Canal L (2005) A normal approximation for the chi-square distribution. Comput Stat Data Anal 48(4):803–808

    Article  MathSciNet  Google Scholar 

  9. Chentanez N, Barto AG, Singh SP (2005) Intrinsically motivated reinforcement learning. In: Advances in neural information processing systems, pp 1281–1288

  10. Chou PW, Maturana D, Scherer S (2017) Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In: International conference on machine learning, pp 834–843

  11. Contreras-Reyes JE (2014) Asymptotic form of the Kullback–Leibler divergence for multivariate asymmetric heavy-tailed distributions. Physica A: Statistical Mechanics and its Applications 395:200–208

    Article  MathSciNet  Google Scholar 

  12. Cui Y, Matsubara T, Sugimoto K (2017) Kernel dynamic policy programming: Applicable reinforcement learning to robot systems with high dimensional states. Neural Netw 94:13–23

    Article  Google Scholar 

  13. Daniel C, Neumann G, Kroemer O, Peters J (2016) Hierarchical relative entropy policy search. J Mach Learn Res 17(93):1–50

    MathSciNet  MATH  Google Scholar 

  14. Gu S, Lillicrap T, Turner RE, Ghahramani Z, Schölkopf B., Levine S (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In: Advances in neural information processing systems, pp 3849–3858

  15. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290

  16. Heess N, Sriram S, Lemmon J, Merel J, Wayne G, Tassa Y, Erez T, Wang Z, Eslami A, Riedmiller M et al (2017) Emergence of locomotion behaviours in rich environments. arXiv:1707.02286

  17. Hirai K, Hirose M, Haikawa Y, Takenaka T (1998) The development of honda humanoid robot. In: IEEE international conference on robotics and automation, vol 2, pp 1321–1326. IEEE

  18. Houthooft R, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P (2016) VIME: Variational information maximizing exploration. In: Advances in neural information processing systems, pp 1109–1117

  19. Hwangbo J, Lee J, Dosovitskiy A, Bellicoso D, Tsounis V, Koltun V, Hutter M (2019) Learning agile and dynamic motor skills for legged robots. Sci Robot 4(26):eaau5872

    Article  Google Scholar 

  20. Kakade SM (2002) A natural policy gradient. In: Advances in neural information processing systems, pp 1531–1538

  21. Kingma D, Ba J (2015) Adam: A method for stochastic optimization. In: International conference for learning representations, pp 1–15

  22. Kobayashi T, Aoyama T, Sekiyama K, Fukuda T (2015) Selection algorithm for locomotion based on the evaluation of falling risk. IEEE Trans Robot 31(3):750–765

    Article  Google Scholar 

  23. Lange KL, Little RJ, Taylor JM (1989) Robust statistical modeling using the t distribution. J Am Stat Assoc 84(408):881–896

    MathSciNet  Google Scholar 

  24. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv:1509.02971

  25. Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  26. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937

  27. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  28. Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: Theory and application to reward shaping. In: International conference on machine learning, vol 99, pp 278–287

  29. Rohmer E, Singh SP, Freese M (2013) V-rep: A versatile and scalable robot simulation framework. In: IEEE/RSJ international conference on intelligent robots and systems, pp 1321–1326. IEEE

  30. Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. In: International conference for learning representations, pp 1–14

  31. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347

  32. Shah A, Wilson A, Ghahramani Z (2014) Student-t processes as alternatives to gaussian processes. In: Artificial intelligence and statistics, pp 877–885

  33. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, pp 387–395

  34. Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. MIT Press, Cambridge

    MATH  Google Scholar 

  35. Svensén M, Bishop CM (2005) Robust bayesian mixture modelling. Neurocomputing 64:235–252

    Article  Google Scholar 

  36. Thomas P (2014) Bias in natural actor-critic algorithms. In: International conference on machine learning, pp 441–448

  37. Tsurumine Y, Cui Y, Uchibe E, Matsubara T (2019) Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robot Auton Syst 112:72–83

    Article  Google Scholar 

  38. Van Seijen H, Mahmood AR, Pilarski PM, Machado MC, Sutton RS (2016) True online temporal-difference learning. J Mach Learn Res 17(145):1–40

    MathSciNet  MATH  Google Scholar 

  39. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3-4):229–256

    Article  Google Scholar 

  40. Zhao X, Ding S, An Y, Jia W (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights. Appl Intell 49(2):581–591

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by JSPS KAKENHI, Grant-in-Aid for Young Scientists (B), Grant Number 17K12759.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taisuke Kobayashi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

A Details of the update rule

2.1 A.1 The gradient of PPO

The state-of-the-art method PPO [31] aims to maximize the following expected value by optimizing the policy parameter wA.

$$ \begin{array}{@{}rcl@{}} &&\max_{\boldsymbol{w}_{A}} \mathbb{E}_{t} \left[ \hat r_{t}\hat A_{t} - \beta_{1} D_{KL}(\pi(\cdot \mid s_{t}, \boldsymbol{w}_{A,t-1}) \mid \pi(\cdot \mid s_{t}, \boldsymbol{w}_{A,t}))\right.\\ &&+ \left. \beta_{2} H(\pi(\cdot \mid s_{t}, \boldsymbol{w}_{A,t})) \right] \end{array} $$
(32)

where \(\hat r_{t} = \frac {\pi (a_{t} \mid s_{t}, \boldsymbol {w}_{A,t})}{\pi (a_{t} \mid s_{t}, \boldsymbol {w}_{A,t-1})}\) represents the importance sampling and \(\hat A_{t}\) is the advantage function approximated by GAE [30]. The gradient of this term corresponds to the term excluding PPO(⋅) from (7).

DKL(⋅) and H(⋅) indicate the KL divergence and differential entropy with β1,2 coefficients, respectively. Therefore, in (7), PPO(⋅) corresponds to the gradients of DKL(⋅) and H(⋅). While the policy update is suppressed by the KL penalty, the entropy bonus encourages exploration. In addition, β1 is adjusted by a simple heuristic approach in the original paper. The approach of adjusting β1 is modified as follows:

$$ \beta_{1} \leftarrow \beta_{1} \exp\left( \frac{\bar D_{KL}^{\text{new}} - \bar D_{KL}^{\text{old}}}{\bar D_{KL}^{\text{new}} + \bar D_{KL}^{\text{old}}} \right) $$
(33)

where \(\bar D_{KL}^{\mathrm {old,new}}\) are the moving averages before and after the update by the current DKL(⋅).

Note that all DKL(⋅) and H(⋅), except DKL(⋅) for the student-t policy, can be analytically derived. In addition, their gradients are also analytically calculated. DKL(⋅) for the student-t policy, however, needs to be approximated as a closed-form solution depending on ref. [11] to derive its gradient analytically.

Given \(p_{1} \sim \mathcal {T}(\boldsymbol {\mu }_{1} , {\Sigma }_{1} , \nu _{1})\) and \(p_{2} \sim \mathcal {T}(\boldsymbol {\mu }_{2} , {\Sigma }_{2} , \nu _{2})\), the KL divergence between them is approximated as follows:

$$ \begin{array}{@{}rcl@{}} D_{KL}(p_{1} \mid p_{2}) &\simeq& \frac{1}{2}\ln\frac{|{\Sigma}_{2}|}{|{\Sigma}_{1}|} + \frac{1}{2}\frac{\nu_{2}+d}{\nu_{2}}\frac{\nu_{1}}{\nu_{1}-2}\text{tr}({\Sigma}_{2}^{-1}{\Sigma}_{1}) \\ &&+ \frac{1}{2}\frac{\nu_{2}+d}{\nu_{2}}(\boldsymbol{\mu}_{1} - \boldsymbol{\mu}_{2})^{\top}{\Sigma}_{2}^{-1}(\boldsymbol{\mu}_{1} - \boldsymbol{\mu}_{2})\\ &&- \frac{\nu_{1}+d}{2}\left\{ \psi\left( \frac{\nu_{1}+d}{2}\right) - \psi\left( \frac{\nu_{1}}{2}\right) \right\} \end{array} $$
(34)

Note that ν1 should be larger than 2, although the student-t policy can have ν smaller than 2. In this paper, its gradient is calculated by adding 2 − ν0 as an offset to ν.

2.2 A.2 True online GAE

The original TD(λ) [34] recursively updates the weights for the value function wC by using the eligibility trace e.

$$ \begin{array}{@{}rcl@{}} \boldsymbol{e}_{C,t} &=& \gamma \lambda \boldsymbol{e}_{C,t-1} + \boldsymbol{x}(s_{t}) \end{array} $$
(35)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{g}_{C,t} &=& \delta_{t} \boldsymbol{e}_{C,t} \end{array} $$
(36)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{w}_{C,t+1} &=& \boldsymbol{w}_{C,t} + \alpha \boldsymbol{g}_{C,t} \end{array} $$
(37)

This TD(λ) includes an approximation, and therefore, the true online TD(λ) [38] has derived the exact recursive update rule (see (2)–(4)).

GAE approximates the advantage function as \(\hat A\) by accumulating the TD error δ.

$$ \hat A_{t} = \sum\limits_{k=0}^{\infty} (\gamma \lambda)^{k} \delta_{t+k} $$
(38)

By using this GAE [30] and REINFORCE [39], the offline policy gradient gA is given as follows:

$$ \begin{array}{@{}rcl@{}} \boldsymbol{g}_{A} &=& \sum\limits_{t=0}^{\infty} \hat A_{t} \nabla_{\boldsymbol{w}_{A}}\ln\pi(a_{t} \mid s_{t}, \boldsymbol{w}_{A}) \end{array} $$
(39)
$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{t=0}^{\infty} \boldsymbol{\hat x}(s_{t},a_{t}) \sum\limits_{k=0}^{\infty} (\gamma \lambda)^{k} \delta_{t+k} \end{array} $$
(40)
$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{t=0}^{\infty} \delta_{t} \sum\limits_{k=0}^{t} (\gamma \lambda)^{k} \boldsymbol{\hat x}(s_{t_{k}},a_{t-k}) \end{array} $$
(41)

where \(\boldsymbol {\hat x}\) is defined in (5). Here, \(\boldsymbol {e}_{A,t} = {\sum }_{k=0}^{t}(\gamma \)\(\lambda )^{k} \boldsymbol {\hat x}(s_{t_{k}},a_{t-k})\) is defined, and consequently, this policy gradient can be recursively approximated by the following equations.

$$ \begin{array}{@{}rcl@{}} \boldsymbol{e}_{A,t} &=& \gamma \lambda \boldsymbol{e}_{A,t-1} + \boldsymbol{\hat x}(s_{t}, a_{t}) \end{array} $$
(42)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{g}_{A,t} &=& \delta_{t} \boldsymbol{e}{A,t} \end{array} $$
(43)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{w}_{A,t+1} &=& \boldsymbol{w}_{A,t} + \alpha \boldsymbol{g}_{A,t} \end{array} $$
(44)

As can be seen from the above equations, the update rule of the actor by the recursive GAE is equivalent to the update rule of the critic by the original TD(λ). Hence, the true online version of GAE can be derived as shown in (5)–(8).

B Task specifications

All the simulation environments have been created in V-REP [29] and uploaded in Github: https://github.com/kbys-t/gym_vrepto reproduce the results. Here, let us introduce their designs of state, action, reward, and end conditions.

3.1 B.1 (a) The rolling balance task

An agent, a half-sized humanoid robot NAO, tries to balance on a board on a cylinder. The agent stands at the center of the board horizontal to the ground at the start of an episode. State space is defined by three states: roll angle of the agent; roll angle of the board, and angular velocity of the roll angle of the board. Action space is defined by one action: an angular velocity of the roll angle of the agent. If the board comes in contact with the ground, the episode is terminated as a failure. In the case of the failure, the agent receives a big penalty of − 100. At every step of the episode, the agent gets a reward as the board and ground are almost parallel (the maximum is 1). In this task, local exploration is enough to reach the global optimum, although a highly efficient exploration would be able to maintain the balance faster.

3.2 B.2 (b) The task to navigate ballbot

An agent, an omnidirectional mobile robot on a ball, tries to reach a goal on (1.0, 1.5) while avoiding a set of dining table and chairs. The agent is stopped on (− 1.5, 1.5) at the start of an episode. State space is defined by four states: two-dimensional (2D) positions of the agent and 2D differences between the agent and a reference point. Action space is defined by two actions: 2D velocities of the reference point. If the agent falls over, the episode will be terminated as a failure with penalty − 1. At every step of the episode, the agent gets a reward according to the distance between the agent and the goal (the maximum is 1). The agent cannot recognize the set of dining table and chairs, which hinders the shortest path to the goal; hence, it is required to find a detour route to the goal only by sufficient exploration.

3.3 B.3 (c) The task to open door

An agent, a manipulator with seven degrees of freedom in front of a door, tries to open the door after turning its knob. The agent’s hand is on the knob at the start of an episode. State space is defined by five states: 3D positions of the agent’s hand, the angle of the knob, and the angle of the door. Action space is defined by three actions: 3D velocities of the agent’s hand. If the agent perfectly opens the door, the episode will be terminated as success without bonus. At every step of the episode, the agent gets a small reward according to the angle of the knob (the maximum is 0.05) and a large reward according to the angle of the door (the maximum is 1). Since the agent simultaneously gets two rewards, this task is prone to the local optima. For example, the agent turns the knob but does not push the door because when pushing the door, the agent’s hand is likely to get away from the knob.

3.4 B.4 (d) The task to reach targets

An agent, two planar manipulators, one of which is installed on the first link of another, tries to make their tips reach the respective targets, i.e., (− 0.5, 0) and (0.5, 0), simultaneously. State space is defined by four states: the angles of respective joints. Action space is defined by four actions: the angular velocities of the respective joints. If both tips reach within a radius of 0.1 m of the respective targets simultaneously, the agent gets a large bonus of 100 and the episode is terminated as a success. At every step of the episode, the agent cannot get any reward; namely, the reward is obtained only when the task succeeds. Such sparse reward makes it difficult to resolve the task by reinforcement learning with poor exploration ability.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kobayashi, T. Student-t policy in reinforcement learning to acquire global optimum of robot control. Appl Intell 49, 4335–4347 (2019). https://doi.org/10.1007/s10489-019-01510-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01510-8

Keywords

Navigation