Abstract
Learning a policy from sparse rewards is a main challenge in reinforcement learning (RL). The best solutions to this challenge have been via sample inefficient model-free RL algorithms. Model-based RL algorithms are known to be sample efficient but few of them can solve sparse settings. To address these limitations, we present PlanQ, a sample efficient model-based RL framework that resolves sparse reward settings. PlanQ leverages Q-values that encode long-term values and serve as a richer feedback signal to actions than immediate rewards. As such, PlanQ scores rollout returns from its learned model with returns containing Q-values. We verify the efficacy of the approach on robot manipulation tasks whose difficulties range from simple to complex. Our experimental results show that PlanQ enhances performance and efficiency in sparse reward settings.
The work is supported by the Key Research & Development Prog. of Guangdong (2019B090915001), the Guangdong Yangfan Innovative & Entrepreneurial Program (2017YT05G026), the CUHK Direct Grant (4055141), the NSF of China (62176154), and Hong Kong Center for Logistic Robotics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The simpler returns are weighed by summing all the returns of one imagined rollout without weighing them.
- 2.
Note that this goal transformation does not change the environment dynamics.
References
Charlesworth, H.J., Montana, G.: Solving challenging dexterous manipulation tasks with trajectory optimisation and reinforcement learning. In: International Conference on Machine Learning, pp. 1496–1506. PMLR (2021)
Silver, D., Huang, A., Maddison, C.J., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Andrychowicz, O.A.I.M., Baker, B., Chociej, M., et al.: Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39(1), 3–20 (2020)
Pathak, D., Agrawal, P., Efros, A.A., et al.: Curiosity-driven exploration by self-supervised prediction. In: International Conference on Machine Learning, pp. 2778–2787. PMLR (2017)
Bengio, Y., Louradour, J., Collobert, R., et al.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)
Andrychowicz, M., Wolski, F., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems, pp. 5048–5058 (2017)
Gu, S., Lillicrap, T., Sutskever, I., et al.: Continuous deep Q-learning with model-based acceleration. In: International Conference on Machine Learning, pp. 2829–2838. PMLR (2016)
Kalweit, G., Boedecker, J.: Uncertainty-driven imagination for continuous deep reinforcement learning. In: Conference on Robot Learning, pp. 195–206. PMLR (2017)
Kurutach, T., Clavera, I., Duan, Y., et al.: Model-ensemble trust-region policy optimization. In: International Conference on Learning Representations (2018)
Feinberg, V., Wan, A., Stoica, I., et al.: Model-based value estimation for efficient model-free reinforcement learning. Arxiv:1803.00101 (2018)
Buckman, J., Hafner, D., Tucker, G., et al.: Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Advances in Neural Information Processing Systems, p. 31 (2018)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)
Wang, T., Ba, J.: Exploring model-based planning with policy networks. In: International Conference on Learning Representations (2019)
Nagabandi, A., Kahn, G., Fearing, R.S., et al.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: International Conference on Robotics and Automation, pp. 7559–7566. IEEE (2018)
Chua, K., Calandra, R., McAllister, R., et al.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in Neural Information Processing Systems, p. 31 (2018)
Charlesworth, H., Montana, G.: PlanGAN: model-based planning with sparse rewards and multiple goals. In: Advances in Neural Information Processing Systems, pp. 8532–8542 (2020)
Lillicrap, T.P., Hunt, J.J., Pritzel, A., et al.: Continuous control with deep reinforcement learning. In: International Conference on Learning Representations (2016)
Nagabandi, A., Konolige, K., Levine, S., et al.: Deep dynamics models for learning dexterous manipulation. In: Conference on Robot Learning, pp. 1101–1112. PMLR (2020)
Plappert, M., Andrychowicz, M., Ray, A., et al.: Multi-goal reinforcement learning: challenging robotics environments and request for research. arxiv:1802.09464 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lei, H., Weng, P., Rojas, J., Guan, Y. (2022). Planning with Q-Values in Sparse Reward Reinforcement Learning. In: Liu, H., et al. Intelligent Robotics and Applications. ICIRA 2022. Lecture Notes in Computer Science(), vol 13455. Springer, Cham. https://doi.org/10.1007/978-3-031-13844-7_56
Download citation
DOI: https://doi.org/10.1007/978-3-031-13844-7_56
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13843-0
Online ISBN: 978-3-031-13844-7
eBook Packages: Computer ScienceComputer Science (R0)