Abstract
The Q-learning algorithm suffers from overestimation bias due to the maximum operator appearing in its update rule. Other popular variants of Q-learning, like double Q-learning, can on the other hand cause underestimation of the action values. In many stochastic environments both underestimation and overestimation can lead to sub-optimal strategies. In this paper, we present a variation of Q-learning that uses elements from Monte-Carlo Reinforcement Learning to correct for the overestimation bias. Our method 1) makes no assumptions on the distributions of the action values or the rewards, 2) does not require extensive hyperparameter tuning unlike other popular variants proposed to deal with the overestimation bias and 3) requires storing only two estimators, similar to double Q-learning, along with the most recent episode. Our method is shown to effectively control for the overestimation bias in a number of simulated stochastic environments leading to better policies with higher cumulative rewards and action values that are closer to the optimal ones, as compared to a number of well-established approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
D’Eramo, C., Restelli, M., Nuara, A.: Estimating maximum expected value through gaussian approximation. In: International Conference on Machine Learning, pp. 1032–1040 (2016)
Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. 13, 227–303 (2000)
S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018
Gumbel, E.J.: Statistics of Extremes. Columbia University Press, New York (1958)
Hanna, J.P., Niekum, S., Stone, P.: Importance sampling in reinforcement learning with an estimated behavior policy. Mach. Learn. 110(6), 1267–1317 (2021). https://doi.org/10.1007/s10994-020-05938-9
Hasselt, H.V.: Double Q-learning. In: Advances in Neural Information Processing Systems, pp. 2613–2621 (2010)
Lan, Q., Pan, Y., Fyshe, A., White, M.: Maxmin Q-learning: controlling the estimation bias of Q-learning. arXiv preprint arXiv:2002.06487 (2020)
Lee, D., Defourny, B., Powell, W.B.: Bias-corrected q-learning to control max-operator bias in q-learning. In: 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 93–99. IEEE (2013)
Smith, J.E., Winkler, R.L.: The optimizer’s curse: skepticism and postdecision surprise in decision analysis. Manage. Sci. 52(3), 311–322 (2006)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, Cambridge (2018)
Thrun, S., Schwartz, A.: Issues in using function approximation for reinforcement learning. In: Proceedings of the Fourth Connectionist Models Summer School, pp. 255–263. Hillsdale, NJ (1993)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Zhang, Z., Pan, Z., Kochenderfer, M.J.: Weighted double q-learning. In IJCAI, pp. 3455–3461 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Papadimitriou, D. (2023). Monte Carlo Bias Correction in Q-Learning. In: Goertzel, B., Iklé, M., Potapov, A., Ponomaryov, D. (eds) Artificial General Intelligence. AGI 2022. Lecture Notes in Computer Science(), vol 13539. Springer, Cham. https://doi.org/10.1007/978-3-031-19907-3_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-19907-3_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19906-6
Online ISBN: 978-3-031-19907-3
eBook Packages: Computer ScienceComputer Science (R0)