Gradient Estimation in Model-Based Reinforcement Learning: A Study on Linear Quadratic Environments

Lovatto, Ângelo Gregório; Bueno, Thiago Pereira; de Barros, Leliane Nunes

doi:10.1007/978-3-030-91702-9_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13073))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

575 Accesses
1 Citations

Abstract

Stochastic Value Gradient (SVG) methods underlie many recent achievements of model-based Reinforcement Learning agents in continuous state-action spaces. Despite their practical significance, many algorithm design choices still lack rigorous theoretical or empirical justification. In this work, we analyze one such design choice: the gradient estimator formula. We conduct our analysis on randomized Linear Quadratic Gaussian environments, allowing us to empirically assess gradient estimation quality relative to the actual SVG. Our results justify a widely used gradient estimator by showing it induces a favorable bias-variance tradeoff, which could explain the lower sample complexity of recent SVG methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Our formula differs slightly from the original in that it considers a deterministic policy instead of a stochastic one.
2.
We use the make_spd_matrix function.
3.
We use the scipy.signal.place_poles function.
4.
We use the same 10 random seeds for experiments across values of K.
5.
We use seaborn.lineplot to produce the aggregated curves.
6.
Learning rate of \(10^{-2}\), \(B=200\), and \(K=8\).
7.
Recall from Sect. 3 that LQG allows us to compute the optimal policy analytically.
8.
We found that the computation times for both estimators were equivalent.
9.
We only clip the gradient norm at a maximum of 100 to avoid numerical errors.

References

Amos, B., Stanton, S., Yarats, D., Wilson, A.G.: On the model-based stochastic value gradient for continuous reinforcement learning. CoRR arXiv:2008.1 (2020)
Byravan, A., et al.: Imagined value gradients: model-based policy optimization with transferable latent dynamics models. In: CoRL. Proceedings of Machine Learning Research, vol. 100, pp. 566–589. PMLR (2019)
Google Scholar
Chan, S.C.Y., Fishman, S., Korattikara, A., Canny, J., Guadarrama, S.: Measuring the reliability of reinforcement learning algorithms. In: ICLR. OpenReview.net (2020)
Google Scholar
Clavera, I., Fu, Y., Abbeel, P.: Model-augmented actor-critic: backpropagating through paths. In: ICLR. OpenReview.net (2020). https://openreview.net/forum?id=Skln2A4YDB
Deisenroth, M.P., Rasmussen, C.E.: PILCO: a model-based and data-efficient approach to policy search. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, 28 June–2 July 2011, pp. 465–472. Omnipress (2011). https://icml.cc/2011/papers/323_icmlpaper.pdf
Engstrom, L., et al.: Implementation matters in deep RL: a case study on PPO and TRPO. In: ICLR. OpenReview.net (2020). https://github.com/implementation-matters/code-for-paper
Goodfellow, I.J., Bengio, Y., Courville, A.C.: Deep Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Hafner, D., Lillicrap, T.P., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: ICLR. OpenReview.net (2020)
Google Scholar
Heess, N., Wayne, G., Silver, D., Lillicrap, T.P., Erez, T., Tassa, Y.: Learning continuous control policies by stochastic value gradients. In: NIPS, pp. 2944–2952 (2015). http://papers.nips.cc/paper/5796-learning-continuous-control-policies-by-stochastic-value-gradients
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: AAAI, pp. 3207–3214. AAAI Press (2018)
Google Scholar
Ilyas, A., et al.: A closer look at deep policy gradients. In: ICLR. OpenReview.net (2020)
Google Scholar
Islam, R., Henderson, P., Gomrokchi, M., Precup, D.: Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR arXiv:1708.04133 (2017)
Liu, Z., Li, X., Kang, B., Darrell, T.: Regularization matters for policy optimization - an empirical study on continuous control. In: International Conference on Learning Representations (2021). https://github.com/xuanlinli17/iclr2021_rlreg
Lovatto, A.G., Bueno, T.P., Mauá, D.D., de Barros, L.N.: Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings on “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. Proceedings of Machine Learning Research, vol. 137, pp. 76–86. PMLR, December 2020. http://proceedings.mlr.press/v137/lovatto20a.html
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Moerland, T.M., Broekens, J., Jonker, C.M.: Model-based reinforcement learning: a survey. In: Proceedings of the International Conference on Electronic Business (ICEB) 2018-December, pp. 421–429 (2020). http://arxiv.org/abs/2006.16712
Mohamed, S., Rosca, M., Figurnov, M., Mnih, A.: Monte Carlo gradient estimation in machine learning. J. Mach. Learn. Res. 21, 132:1–132:62 (2020)
Google Scholar
Paszke, A., et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Polydoros, A.S., Nalpantidis, L.: Survey of model-based reinforcement learning: applications on robotics. J. Intell. Robot. Syst. 86(2), 153–173 (2017). https://doi.org/10.1007/s10846-017-0468-y
Article Google Scholar
Recht, B.: A tour of reinforcement learning: the view from continuous control. Ann. Rev. Control Robot. Auton. Syst. 2(1), 253–279 (2019). https://doi.org/10.1146/annurev-control-053018-023825, http://arxiv.org/abs/1806.09460
Ruder, S.: An overview of gradient descent optimization algorithms. CoRR arXiv:1609.04747 (2016)
Schulman, J., Heess, N., Weber, T., Abbeel, P.: Gradient estimation using stochastic computation graphs. In: NIPS, pp. 3528–3536 (2015)
Google Scholar
Silver, D., Lever, G., Technologies, D., Lever, G.U.Y., Ac, U.C.L.: Deterministic Policy Gradient (DPG). In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, no. 1, pp. 387–395 (2014). http://proceedings.mlr.press/v32/silver14.html
Szepesvári, C.: Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2010). https://doi.org/10.2200/S00268ED1V01Y201005AIM009
Tits, A.L., Yang, Y.: Globally convergent algorithms for robust pole assignment by state feedback. IEEE Trans. Autom. Control 41(10), 1432–1452 (1996). https://doi.org/10.1109/9.539425
Article MathSciNet MATH Google Scholar
Todorov, E.: Optimal Control Theory. Bayesian Brain: Probabilistic Approaches to Neural Coding, pp. 269–298 (2006)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: IEEE International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012). https://doi.org/10.1109/IROS.2012.6386109, http://ieeexplore.ieee.org/document/6386109/
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Meth. 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Article Google Scholar

Download references

Acknowledgments

This work was partly supported by the CAPES grant 88887.339578/2019-00 (first author), FAPESP grant 2016/22900-1 (second author), and CNPq scholarship 307979/2018-0 (third author).

Author information

Authors and Affiliations

Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
Ângelo Gregório Lovatto, Thiago Pereira Bueno & Leliane Nunes de Barros

Authors

Ângelo Gregório Lovatto
View author publications
You can also search for this author in PubMed Google Scholar
Thiago Pereira Bueno
View author publications
You can also search for this author in PubMed Google Scholar
Leliane Nunes de Barros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ângelo Gregório Lovatto .

Editor information

Editors and Affiliations

Universidade Federal de Sergipe, São Cristóvão, Brazil
André Britto
Universidade de São Paulo, São Paulo, Brazil
Karina Valdivia Delgado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lovatto, Â.G., Bueno, T.P., de Barros, L.N. (2021). Gradient Estimation in Model-Based Reinforcement Learning: A Study on Linear Quadratic Environments. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13073. Springer, Cham. https://doi.org/10.1007/978-3-030-91702-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-91702-9_3
Published: 28 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91701-2
Online ISBN: 978-3-030-91702-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics