Abstract
Deep Q-Learning Networks (DQNs) have already achieved great successes, but the 𝜖-greedy exploration strategy adopted in the original DQNs is very inefficient in exploring the state-action space, making the model easily get stuck at a poor optimal. To address this issue, inspired by the successes of upper confidence bound (UCB) methods primarily targeting the bandit problems, two more efficient exploration strategies are proposed by seeking to estimate the confidence bounds of the future long-term returns. The proposed methods are adapted to the high-dimensional DQNs by quantifying the state space and using a neural network to estimate the standard deviation. Experimental results show that our proposed methods perform better than the original DQN in terms of the total number of iterations and the final score. Notably, in 15 games we evaluate, our proposed methods get the highest score in 13 games and the score increases by a maximum of 8.27x compared to UCB-Exploration. When our algorithm is applied to the widely used Rainbow algorithm, substantial improvements were observed.
Similar content being viewed by others
References
Auer P, Ortner R (2006) Logarithmic online regret bounds for undiscounted reinforcement learning. In: Conference on advances in neural information processing systems
Balaprakash P, Egele R, Salim M, Wild S, Vishwanath V, Xia F, Brettin T, Stevens R (2019) Scalable reinforcement-learning-based neural architecture search for cancer deep learning research. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–33
Barto GA (2013) Intrinsic Motivation and Reinforcement Learning. Springer, Berlin, pp 17–47
Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Advances in neural information processing systems, pp 1471–1479
Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 449–458. JMLR.org
Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279
Castro PS, Moitra S, Gelada C, Kumar S, Bellemare MG (2018) Dopamine: A research framework for deep reinforcement learning. arXiv:1812.06110
Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ (eds) advances in neural information processing systems. http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf, vol 24. Curran Associates Inc, pp 2249–2257
Chen G, Peng Y, Zhang M (2018) Effective exploration for deep reinforcement learning via bootstrapped q-ensembles under tsallis entropy regularization. arXiv:1809.00403
Chen RY, Sidor S, Abbeel P, Schulman J (2017) Ucb exploration via q-ensembles. arXiv:1706.01502
Dabney W, Ostrovski G, Silver D, Munos R (2018) Implicit quantile networks for distributional reinforcement learning. In: International conference on machine learning. PMLR, pp 1096–1105
Diuk C, Cohen A, Littman ML (2008) An object-oriented representation for efficient reinforcement learning. In: International conference on machine learning
Hasselt HV, Guez A, Silver D (2015) Deep reinforcement learning with double q-learning Computer Science
He J, Chen J, He X, Gao J, Ostendorf M (2016) Deep reinforcement learning with a natural language action space. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers)
Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2018) Rainbow: Combining improvements in deep reinforcement learning. In: Thirty-second AAAI conference on artificial intelligence
Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, van Hasselt H, Silver D (2018) Distributed prioritized experience replay
Jaksch T, Ortner R, Auer P (2008) Near-optimal regret bounds for reinforcement learning. In: Conference on neural information processing systems
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is q-learning provably efficient? In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems. http://papers.nips.cc/paper/7735-is-q-learning-provably-efficient.pdf, vol 31. Curran Associates Inc, pp 4863–4873
Lee K, Choi S, Oh S (2018) Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robot Autom Lett 3(3):1466–1473
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. Comput Sci 8(6):A187
Martin J, Sasikumar SN, Everitt T, Hutter M (2017) Count-based exploration in feature space for reinforcement learning. arXiv:1706.08090
Obando-Ceron JS, Castro PS (2020) Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. arXiv:2011.14826
OpenAI (2019) Openai five defeats dota 2 world champions. https://openai.com/blog/openai-five-defeats-dota-2-world-champions/
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped dqn. Adv Neural Inf Process Syst 29:4026–4034
Osband I, Russo D, Roy BV (2013) (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 3003–3011
Ostrovski G, Bellemare MG, van den Oord A, Munos R (2017) Count-based exploration with neural density models. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 2721–2730. JMLR.org
Riedmiller M, Gabel T, Hafner R, Lange S (2009) Reinforcement learning for robot soccer. Auton Robot 27(1):55–73
Russo D, Roy BV (2014) Learning to optimize via posterior sampling. Math Oper Res 39 (4):A95
Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z et al (2018) A tutorial on thompson sampling. Foundations and Trends®;, in Machine Learning 11(1):1–96
Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. arXiv:1511.05952
Schulman J, Levine S, Moritz P, Jordan MI, Abbeel P (2015) Trust region policy optimization. In: ICML
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche GVD, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489
Strens M (2000) A bayesian framework for reinforcement learning. In: Seventeenth international conference on machine learning
Tesauro G (1995) Temporal difference learning and td-gammon. Communications of the Acm 38 (3):58–68
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294
Volodymyr M, Koray K, David S, Rusu AA, Joel V, Bellemare MG, Alex G, Martin R, Fidjeland AK, Georg O (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Young K, Tian T (2019) Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. arXiv:1903.03176
Zambaldi V, Raposo D, Santoro A, Bapst V, Li Y, Babuschkin I, Tuyls K, Reichert D, Lillicrap T, Lockhart E et al (2018) Relational deep reinforcement learning. arXiv:1806.01830
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv:1611.01578
Acknowledgements
This research was supported by the Natural Science Foundation of China under Grant No. U1811464, and was also supported in part by the Guangdong Natural Science Foundation under Grant No. 2018B030312002, in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant NO. 2016ZT06D211, in part by the CCF- Baidu Open Fund OF2021032, in part by the National Natural Science Foundation of China (NSFC) No. 61806223.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wen, Y., Su, Q., Shen, M. et al. Improving the exploration efficiency of DQNs via the confidence bound methods. Appl Intell 52, 15447–15461 (2022). https://doi.org/10.1007/s10489-022-03363-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03363-0