Improving the exploration efficiency of DQNs via the confidence bound methods

Wen, Yingpeng; Su, Qinliang; Shen, Minghua; Xiao, Nong

doi:10.1007/s10489-022-03363-0

Improving the exploration efficiency of DQNs via the confidence bound methods

Published: 16 March 2022

Volume 52, pages 15447–15461, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yingpeng Wen ORCID: orcid.org/0000-0001-9372-1453¹,
Qinliang Su¹,
Minghua Shen¹ &
…
Nong Xiao¹

477 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Deep Q-Learning Networks (DQNs) have already achieved great successes, but the 𝜖-greedy exploration strategy adopted in the original DQNs is very inefficient in exploring the state-action space, making the model easily get stuck at a poor optimal. To address this issue, inspired by the successes of upper confidence bound (UCB) methods primarily targeting the bandit problems, two more efficient exploration strategies are proposed by seeking to estimate the confidence bounds of the future long-term returns. The proposed methods are adapted to the high-dimensional DQNs by quantifying the state space and using a neural network to estimate the standard deviation. Experimental results show that our proposed methods perform better than the original DQN in terms of the total number of iterations and the final score. Notably, in 15 games we evaluate, our proposed methods get the highest score in 13 games and the score increases by a maximum of 8.27x compared to UCB-Exploration. When our algorithm is applied to the widely used Rainbow algorithm, substantial improvements were observed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning and deep learning

Article Open access 08 April 2021

Multi-agent deep reinforcement learning: a survey

Article Open access 15 April 2021

Monte Carlo Tree Search: a review of recent modifications and applications

Article Open access 19 July 2022

Notes

References

Auer P, Ortner R (2006) Logarithmic online regret bounds for undiscounted reinforcement learning. In: Conference on advances in neural information processing systems
Balaprakash P, Egele R, Salim M, Wild S, Vishwanath V, Xia F, Brettin T, Stevens R (2019) Scalable reinforcement-learning-based neural architecture search for cancer deep learning research. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–33
Barto GA (2013) Intrinsic Motivation and Reinforcement Learning. Springer, Berlin, pp 17–47
Google Scholar
Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Advances in neural information processing systems, pp 1471–1479
Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 449–458. JMLR.org
Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279
Article Google Scholar
Castro PS, Moitra S, Gelada C, Kumar S, Bellemare MG (2018) Dopamine: A research framework for deep reinforcement learning. arXiv:1812.06110
Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ (eds) advances in neural information processing systems. http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf, vol 24. Curran Associates Inc, pp 2249–2257
Chen G, Peng Y, Zhang M (2018) Effective exploration for deep reinforcement learning via bootstrapped q-ensembles under tsallis entropy regularization. arXiv:1809.00403
Chen RY, Sidor S, Abbeel P, Schulman J (2017) Ucb exploration via q-ensembles. arXiv:1706.01502
Dabney W, Ostrovski G, Silver D, Munos R (2018) Implicit quantile networks for distributional reinforcement learning. In: International conference on machine learning. PMLR, pp 1096–1105
Diuk C, Cohen A, Littman ML (2008) An object-oriented representation for efficient reinforcement learning. In: International conference on machine learning
Hasselt HV, Guez A, Silver D (2015) Deep reinforcement learning with double q-learning Computer Science
He J, Chen J, He X, Gao J, Ostendorf M (2016) Deep reinforcement learning with a natural language action space. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers)
Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2018) Rainbow: Combining improvements in deep reinforcement learning. In: Thirty-second AAAI conference on artificial intelligence
Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, van Hasselt H, Silver D (2018) Distributed prioritized experience replay
Jaksch T, Ortner R, Auer P (2008) Near-optimal regret bounds for reinforcement learning. In: Conference on neural information processing systems
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is q-learning provably efficient? In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems. http://papers.nips.cc/paper/7735-is-q-learning-provably-efficient.pdf, vol 31. Curran Associates Inc, pp 4863–4873
Lee K, Choi S, Oh S (2018) Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robot Autom Lett 3(3):1466–1473
Article Google Scholar
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. Comput Sci 8(6):A187
Google Scholar
Martin J, Sasikumar SN, Everitt T, Hutter M (2017) Count-based exploration in feature space for reinforcement learning. arXiv:1706.08090
Obando-Ceron JS, Castro PS (2020) Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. arXiv:2011.14826
OpenAI (2019) Openai five defeats dota 2 world champions. https://openai.com/blog/openai-five-defeats-dota-2-world-champions/
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped dqn. Adv Neural Inf Process Syst 29:4026–4034
Google Scholar
Osband I, Russo D, Roy BV (2013) (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 3003–3011
Ostrovski G, Bellemare MG, van den Oord A, Munos R (2017) Count-based exploration with neural density models. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 2721–2730. JMLR.org
Riedmiller M, Gabel T, Hafner R, Lange S (2009) Reinforcement learning for robot soccer. Auton Robot 27(1):55–73
Article Google Scholar
Russo D, Roy BV (2014) Learning to optimize via posterior sampling. Math Oper Res 39 (4):A95
Article MathSciNet Google Scholar
Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z et al (2018) A tutorial on thompson sampling. Foundations and Trends®;, in Machine Learning 11(1):1–96
Article Google Scholar
Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. arXiv:1511.05952
Schulman J, Levine S, Moritz P, Jordan MI, Abbeel P (2015) Trust region policy optimization. In: ICML
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche GVD, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489
Article Google Scholar
Strens M (2000) A bayesian framework for reinforcement learning. In: Seventeenth international conference on machine learning
Tesauro G (1995) Temporal difference learning and td-gammon. Communications of the Acm 38 (3):58–68
Article Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294
Article Google Scholar
Volodymyr M, Koray K, David S, Rusu AA, Joel V, Bellemare MG, Alex G, Martin R, Fidjeland AK, Georg O (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Article Google Scholar
Young K, Tian T (2019) Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. arXiv:1903.03176
Zambaldi V, Raposo D, Santoro A, Bapst V, Li Y, Babuschkin I, Tuyls K, Reichert D, Lillicrap T, Lockhart E et al (2018) Relational deep reinforcement learning. arXiv:1806.01830
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv:1611.01578

Download references

Acknowledgements

This research was supported by the Natural Science Foundation of China under Grant No. U1811464, and was also supported in part by the Guangdong Natural Science Foundation under Grant No. 2018B030312002, in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant NO. 2016ZT06D211, in part by the CCF- Baidu Open Fund OF2021032, in part by the National Natural Science Foundation of China (NSFC) No. 61806223.

Author information

Authors and Affiliations

SunYat-sen University, 132 Waihuandong Rd, Guangzhou, Guangdong, 510006, China
Yingpeng Wen, Qinliang Su, Minghua Shen & Nong Xiao

Authors

Yingpeng Wen
View author publications
You can also search for this author in PubMed Google Scholar
Qinliang Su
View author publications
You can also search for this author in PubMed Google Scholar
Minghua Shen
View author publications
You can also search for this author in PubMed Google Scholar
Nong Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yingpeng Wen.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wen, Y., Su, Q., Shen, M. et al. Improving the exploration efficiency of DQNs via the confidence bound methods. Appl Intell 52, 15447–15461 (2022). https://doi.org/10.1007/s10489-022-03363-0

Download citation

Accepted: 07 February 2022
Published: 16 March 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s10489-022-03363-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving the exploration efficiency of DQNs via the confidence bound methods

Abstract

Access this article

Similar content being viewed by others

Machine learning and deep learning

Multi-agent deep reinforcement learning: a survey

Monte Carlo Tree Search: a review of recent modifications and applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving the exploration efficiency of DQNs via the confidence bound methods

Abstract

Access this article

Similar content being viewed by others

Machine learning and deep learning

Multi-agent deep reinforcement learning: a survey

Monte Carlo Tree Search: a review of recent modifications and applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation