Exploration Versus Exploitation in Model-Based Reinforcement Learning: An Empirical Study

Lovatto, Ângelo Gregório; de Barros, Leliane Nunes; Mauá, Denis D.

doi:10.1007/978-3-031-21689-3_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13654 ))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

858 Accesses

Abstract

Model-based Reinforcement Learning (MBRL) agents use data collected by exploration of the environment to produce a model of the dynamics, which is then used to select a policy that maximizes the objective function. Stochastic Value Gradient (SVG) methods perform the latter step by optimizing some estimate of the value function gradient. Despite showing promising empirical results, many implementations of SVG methods lack rigorous theoretical or empirical justification; this casts doubts as to whether good performance are in large part due to the benchmark-overfitting. To better understand the advantages and shortcomings of existing SVG methods, in this work we carry out a fine-grained empirical analysis of three core components of SVG-based agents: (i) the gradient estimator formula, (ii) the model learning and (iii) the value function approximation. To this end, we extend previous work that proposes using Linear Quadratic Gaussian (LQG) regulator problems to benchmark SVG methods. LQG problems are heavily studied in optimal control literature and deliver challenging learning settings while still allowing comparison with ground-truth values. We use such problems to investigate the contribution of each core component of SVG methods to the overall performance. We focus our analysis on the model learning component, which was neglected from previous work, and we show that overfitting to on-policy data can lead to accurate state predictions but inaccurate gradients, highlighting the importance of exploration also in model-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Unlike the RL setting, in the Optimal Control literature it is often assumed that one has access to the true environment dynamics and reward models.
2.
Our formula differs slightly from the original in that it considers a deterministic policy instead of a stochastic one.

References

Amos, B., Stanton, S., Yarats, D., Wilson, A.G.: On the model-based stochastic value gradient for continuous reinforcement learning. CoRR abs/2008.1 (2020)
Google Scholar
Chan, S.C.Y., Fishman, S., Korattikara, A., Canny, J., Guadarrama, S.: Measuring the reliability of reinforcement learning algorithms. In: ICLR (2020)
Google Scholar
Charnes, A., Frome, E.L., Yu, P.L.: The equivalence of generalized least squares and maximum likelihood estimates in the exponential family. J. Am. Stat. Assoc. 71(353), 169–171 (1976)
Article MathSciNet MATH Google Scholar
Clavera, I., Fu, Y., Abbeel, P.: Model-augmented actor-critic: backpropagating through paths. In: ICLR (2020)
Google Scholar
D’Oro, P., Jaskowski, W.: How to learn a useful critic? Model-based Action-Gradient-Estimator Policy Optimization. In: NeurIPS (2020)
Google Scholar
Engstrom, L., et al.: Implementation matters in deep RL: a case study on PPO and TRPO. In: ICLR (2020)
Google Scholar
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. Found. Trends Mach. Learn. 11(3–4), 219–354 (2018)
Article MATH Google Scholar
Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: ICML. Proceedings of Machine Learning Research, vol. 80, pp. 1582–1591. PMLR (2018)
Google Scholar
Goodfellow, I.J., Bengio, Y., Courville, A.C.: Deep Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge (2016)
Google Scholar
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: AAAI, pp. 3207–3214 (2018)
Google Scholar
Ilyas, A., et al.: A closer look at deep policy gradients. In: ICLR (2020)
Google Scholar
Islam, R., Henderson, P., Gomrokchi, M., Precup, D.: Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR (2017)
Google Scholar
Liu, Z., Li, X., Kang, B., Darrell, T.: Regularization matters for policy optimization - an empirical study on continuous control. In: International Conference on Learning Representations (2021)
Google Scholar
Lovatto, A.G., Bueno, T.P., Mauá, D.D., de Barros, L.N.: Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings on “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. Proceedings of Machine Learning Research, vol. 137, pp. 76–86 (2020)
Google Scholar
Lovatto, Â.G., Bueno, T.P., de Barros, L.N.: Gradient estimation in model-based reinforcement learning: a study on linear quadratic environments. In: Britto, A., Valdivia Delgado, K. (eds.) BRACIS 2021. LNCS (LNAI), vol. 13073, pp. 33–47. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91702-9_3
Chapter Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature. 518(7540), 529–533 (2015)
Article Google Scholar
Moerland, T.M., Broekens, J., Jonker, C.M.: Model-based reinforcement learning: a survey. In: Proceedings of the International Conference on Electronic Business 2018-December, pp. 421–429 (2020)
Google Scholar
Polydoros, A.S., Nalpantidis, L.: Survey of model-based reinforcement learning: applications on robotics. J. Intell. Robot. Syst. 86(2), 153–173 (2017). https://doi.org/10.1007/s10846-017-0468-y
Article Google Scholar
Recht, B.: A Tour of Reinforcement Learning: the view from continuous control. Ann. Rev. Control Robot. Auton. Syst. 2(1), 253–279 (2019)
Google Scholar
Ruder, S.: An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016)
Google Scholar
Schulman, J., Heess, N., Weber, T., Abbeel, P.: Gradient estimation using stochastic computation graphs. In: NIPS, pp. 3528–3536 (2015)
Google Scholar
Silver, D., Lever, G.: Deterministic policy gradient. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32(1), pp. 387–395, January 2014
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)
MATH Google Scholar
Szepesvári, C.: Algorithms for reinforcement learning. Synth. Lect. Artif. Intell. Mach. Learn. 4, 1–3 (2010)
MATH Google Scholar
Tits, A.L., Yang, Y.: Globally convergent algorithms for robust pole assignment by state feedback. IEEE Trans. Autom. Control 41(10), 1432–1452 (1996)
Article MathSciNet MATH Google Scholar
Todorov, E.: Optimal Control Theory. Bayesian Brain: Probabilistic Approaches to Neural Coding, pp. 269–298. MIT Press, Cambridge (2006)
Google Scholar
Vinter, R.B., Vinter, R.: Optimal Control. Springer, Boston (2010). https://doi.org/10.1007/978-0-8176-8086-2
Book MATH Google Scholar

Download references

Acknowledgments

This work was partly supported by the CAPES grant #88887.339578/2019-00 (first author), by the joint FAPESP-IBM grant #2019/07665- 4 (second and third authors), and CNPq PQx grant #304012/2019-0 (third author).

Author information

Authors and Affiliations

Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
Ângelo Gregório Lovatto, Leliane Nunes de Barros & Denis D. Mauá

Authors

Ângelo Gregório Lovatto
View author publications
You can also search for this author in PubMed Google Scholar
Leliane Nunes de Barros
View author publications
You can also search for this author in PubMed Google Scholar
Denis D. Mauá
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denis D. Mauá .

Editor information

Editors and Affiliations

Federal University of Rio Grande do Norte, Natal, Brazil
João Carlos Xavier-Junior
Federal University of Bahia, Salvador, Brazil
Ricardo Araújo Rios

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lovatto, Â.G., de Barros, L.N., Mauá, D.D. (2022). Exploration Versus Exploitation in Model-Based Reinforcement Learning: An Empirical Study. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-21689-3_3
Published: 19 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21688-6
Online ISBN: 978-3-031-21689-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploration Versus Exploitation in Model-Based Reinforcement Learning: An Empirical Study