Skip to main content

Exploration Versus Exploitation in Model-Based Reinforcement Learning: An Empirical Study

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2022)

Abstract

Model-based Reinforcement Learning (MBRL) agents use data collected by exploration of the environment to produce a model of the dynamics, which is then used to select a policy that maximizes the objective function. Stochastic Value Gradient (SVG) methods perform the latter step by optimizing some estimate of the value function gradient. Despite showing promising empirical results, many implementations of SVG methods lack rigorous theoretical or empirical justification; this casts doubts as to whether good performance are in large part due to the benchmark-overfitting. To better understand the advantages and shortcomings of existing SVG methods, in this work we carry out a fine-grained empirical analysis of three core components of SVG-based agents: (i) the gradient estimator formula, (ii) the model learning and (iii) the value function approximation. To this end, we extend previous work that proposes using Linear Quadratic Gaussian (LQG) regulator problems to benchmark SVG methods. LQG problems are heavily studied in optimal control literature and deliver challenging learning settings while still allowing comparison with ground-truth values. We use such problems to investigate the contribution of each core component of SVG methods to the overall performance. We focus our analysis on the model learning component, which was neglected from previous work, and we show that overfitting to on-policy data can lead to accurate state predictions but inaccurate gradients, highlighting the importance of exploration also in model-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Unlike the RL setting, in the Optimal Control literature it is often assumed that one has access to the true environment dynamics and reward models.

  2. 2.

    Our formula differs slightly from the original in that it considers a deterministic policy instead of a stochastic one.

References

  1. Amos, B., Stanton, S., Yarats, D., Wilson, A.G.: On the model-based stochastic value gradient for continuous reinforcement learning. CoRR abs/2008.1 (2020)

    Google Scholar 

  2. Chan, S.C.Y., Fishman, S., Korattikara, A., Canny, J., Guadarrama, S.: Measuring the reliability of reinforcement learning algorithms. In: ICLR (2020)

    Google Scholar 

  3. Charnes, A., Frome, E.L., Yu, P.L.: The equivalence of generalized least squares and maximum likelihood estimates in the exponential family. J. Am. Stat. Assoc. 71(353), 169–171 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  4. Clavera, I., Fu, Y., Abbeel, P.: Model-augmented actor-critic: backpropagating through paths. In: ICLR (2020)

    Google Scholar 

  5. D’Oro, P., Jaskowski, W.: How to learn a useful critic? Model-based Action-Gradient-Estimator Policy Optimization. In: NeurIPS (2020)

    Google Scholar 

  6. Engstrom, L., et al.: Implementation matters in deep RL: a case study on PPO and TRPO. In: ICLR (2020)

    Google Scholar 

  7. François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. Found. Trends Mach. Learn. 11(3–4), 219–354 (2018)

    Article  MATH  Google Scholar 

  8. Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: ICML. Proceedings of Machine Learning Research, vol. 80, pp. 1582–1591. PMLR (2018)

    Google Scholar 

  9. Goodfellow, I.J., Bengio, Y., Courville, A.C.: Deep Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge (2016)

    Google Scholar 

  10. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: AAAI, pp. 3207–3214 (2018)

    Google Scholar 

  11. Ilyas, A., et al.: A closer look at deep policy gradients. In: ICLR (2020)

    Google Scholar 

  12. Islam, R., Henderson, P., Gomrokchi, M., Precup, D.: Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR (2017)

    Google Scholar 

  13. Liu, Z., Li, X., Kang, B., Darrell, T.: Regularization matters for policy optimization - an empirical study on continuous control. In: International Conference on Learning Representations (2021)

    Google Scholar 

  14. Lovatto, A.G., Bueno, T.P., Mauá, D.D., de Barros, L.N.: Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings on “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. Proceedings of Machine Learning Research, vol. 137, pp. 76–86 (2020)

    Google Scholar 

  15. Lovatto, Â.G., Bueno, T.P., de Barros, L.N.: Gradient estimation in model-based reinforcement learning: a study on linear quadratic environments. In: Britto, A., Valdivia Delgado, K. (eds.) BRACIS 2021. LNCS (LNAI), vol. 13073, pp. 33–47. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91702-9_3

    Chapter  Google Scholar 

  16. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature. 518(7540), 529–533 (2015)

    Article  Google Scholar 

  17. Moerland, T.M., Broekens, J., Jonker, C.M.: Model-based reinforcement learning: a survey. In: Proceedings of the International Conference on Electronic Business 2018-December, pp. 421–429 (2020)

    Google Scholar 

  18. Polydoros, A.S., Nalpantidis, L.: Survey of model-based reinforcement learning: applications on robotics. J. Intell. Robot. Syst. 86(2), 153–173 (2017). https://doi.org/10.1007/s10846-017-0468-y

    Article  Google Scholar 

  19. Recht, B.: A Tour of Reinforcement Learning: the view from continuous control. Ann. Rev. Control Robot. Auton. Syst. 2(1), 253–279 (2019)

    Google Scholar 

  20. Ruder, S.: An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016)

    Google Scholar 

  21. Schulman, J., Heess, N., Weber, T., Abbeel, P.: Gradient estimation using stochastic computation graphs. In: NIPS, pp. 3528–3536 (2015)

    Google Scholar 

  22. Silver, D., Lever, G.: Deterministic policy gradient. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32(1), pp. 387–395, January 2014

    Google Scholar 

  23. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  24. Szepesvári, C.: Algorithms for reinforcement learning. Synth. Lect. Artif. Intell. Mach. Learn. 4, 1–3 (2010)

    MATH  Google Scholar 

  25. Tits, A.L., Yang, Y.: Globally convergent algorithms for robust pole assignment by state feedback. IEEE Trans. Autom. Control 41(10), 1432–1452 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  26. Todorov, E.: Optimal Control Theory. Bayesian Brain: Probabilistic Approaches to Neural Coding, pp. 269–298. MIT Press, Cambridge (2006)

    Google Scholar 

  27. Vinter, R.B., Vinter, R.: Optimal Control. Springer, Boston (2010). https://doi.org/10.1007/978-0-8176-8086-2

    Book  MATH  Google Scholar 

Download references

Acknowledgments

This work was partly supported by the CAPES grant #88887.339578/2019-00 (first author), by the joint FAPESP-IBM grant #2019/07665- 4 (second and third authors), and CNPq PQx grant #304012/2019-0 (third author).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denis D. Mauá .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lovatto, Â.G., de Barros, L.N., Mauá, D.D. (2022). Exploration Versus Exploitation in Model-Based Reinforcement Learning: An Empirical Study. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21689-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21688-6

  • Online ISBN: 978-3-031-21689-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics