Skip to main content
Log in

Improving the exploration efficiency of DQNs via the confidence bound methods

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Deep Q-Learning Networks (DQNs) have already achieved great successes, but the 𝜖-greedy exploration strategy adopted in the original DQNs is very inefficient in exploring the state-action space, making the model easily get stuck at a poor optimal. To address this issue, inspired by the successes of upper confidence bound (UCB) methods primarily targeting the bandit problems, two more efficient exploration strategies are proposed by seeking to estimate the confidence bounds of the future long-term returns. The proposed methods are adapted to the high-dimensional DQNs by quantifying the state space and using a neural network to estimate the standard deviation. Experimental results show that our proposed methods perform better than the original DQN in terms of the total number of iterations and the final score. Notably, in 15 games we evaluate, our proposed methods get the highest score in 13 games and the score increases by a maximum of 8.27x compared to UCB-Exploration. When our algorithm is applied to the widely used Rainbow algorithm, substantial improvements were observed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://github.com/google/dopamine

  2. https://github.com/JohanSamir/revisiting_rainbow

References

  1. Auer P, Ortner R (2006) Logarithmic online regret bounds for undiscounted reinforcement learning. In: Conference on advances in neural information processing systems

  2. Balaprakash P, Egele R, Salim M, Wild S, Vishwanath V, Xia F, Brettin T, Stevens R (2019) Scalable reinforcement-learning-based neural architecture search for cancer deep learning research. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–33

  3. Barto GA (2013) Intrinsic Motivation and Reinforcement Learning. Springer, Berlin, pp 17–47

    Google Scholar 

  4. Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Advances in neural information processing systems, pp 1471–1479

  5. Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 449–458. JMLR.org

  6. Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279

    Article  Google Scholar 

  7. Castro PS, Moitra S, Gelada C, Kumar S, Bellemare MG (2018) Dopamine: A research framework for deep reinforcement learning. arXiv:1812.06110

  8. Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ (eds) advances in neural information processing systems. http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf, vol 24. Curran Associates Inc, pp 2249–2257

  9. Chen G, Peng Y, Zhang M (2018) Effective exploration for deep reinforcement learning via bootstrapped q-ensembles under tsallis entropy regularization. arXiv:1809.00403

  10. Chen RY, Sidor S, Abbeel P, Schulman J (2017) Ucb exploration via q-ensembles. arXiv:1706.01502

  11. Dabney W, Ostrovski G, Silver D, Munos R (2018) Implicit quantile networks for distributional reinforcement learning. In: International conference on machine learning. PMLR, pp 1096–1105

  12. Diuk C, Cohen A, Littman ML (2008) An object-oriented representation for efficient reinforcement learning. In: International conference on machine learning

  13. Hasselt HV, Guez A, Silver D (2015) Deep reinforcement learning with double q-learning Computer Science

  14. He J, Chen J, He X, Gao J, Ostendorf M (2016) Deep reinforcement learning with a natural language action space. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers)

  15. Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2018) Rainbow: Combining improvements in deep reinforcement learning. In: Thirty-second AAAI conference on artificial intelligence

  16. Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, van Hasselt H, Silver D (2018) Distributed prioritized experience replay

  17. Jaksch T, Ortner R, Auer P (2008) Near-optimal regret bounds for reinforcement learning. In: Conference on neural information processing systems

  18. Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is q-learning provably efficient? In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems. http://papers.nips.cc/paper/7735-is-q-learning-provably-efficient.pdf, vol 31. Curran Associates Inc, pp 4863–4873

  19. Lee K, Choi S, Oh S (2018) Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robot Autom Lett 3(3):1466–1473

    Article  Google Scholar 

  20. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. Comput Sci 8(6):A187

    Google Scholar 

  21. Martin J, Sasikumar SN, Everitt T, Hutter M (2017) Count-based exploration in feature space for reinforcement learning. arXiv:1706.08090

  22. Obando-Ceron JS, Castro PS (2020) Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. arXiv:2011.14826

  23. OpenAI (2019) Openai five defeats dota 2 world champions. https://openai.com/blog/openai-five-defeats-dota-2-world-champions/

  24. Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped dqn. Adv Neural Inf Process Syst 29:4026–4034

    Google Scholar 

  25. Osband I, Russo D, Roy BV (2013) (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 3003–3011

  26. Ostrovski G, Bellemare MG, van den Oord A, Munos R (2017) Count-based exploration with neural density models. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 2721–2730. JMLR.org

  27. Riedmiller M, Gabel T, Hafner R, Lange S (2009) Reinforcement learning for robot soccer. Auton Robot 27(1):55–73

    Article  Google Scholar 

  28. Russo D, Roy BV (2014) Learning to optimize via posterior sampling. Math Oper Res 39 (4):A95

    Article  MathSciNet  Google Scholar 

  29. Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z et al (2018) A tutorial on thompson sampling. Foundations and Trends®;, in Machine Learning 11(1):1–96

    Article  Google Scholar 

  30. Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. arXiv:1511.05952

  31. Schulman J, Levine S, Moritz P, Jordan MI, Abbeel P (2015) Trust region policy optimization. In: ICML

  32. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche GVD, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489

    Article  Google Scholar 

  33. Strens M (2000) A bayesian framework for reinforcement learning. In: Seventeenth international conference on machine learning

  34. Tesauro G (1995) Temporal difference learning and td-gammon. Communications of the Acm 38 (3):58–68

    Article  Google Scholar 

  35. Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294

    Article  Google Scholar 

  36. Volodymyr M, Koray K, David S, Rusu AA, Joel V, Bellemare MG, Alex G, Martin R, Fidjeland AK, Georg O (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529

    Article  Google Scholar 

  37. Young K, Tian T (2019) Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. arXiv:1903.03176

  38. Zambaldi V, Raposo D, Santoro A, Bapst V, Li Y, Babuschkin I, Tuyls K, Reichert D, Lillicrap T, Lockhart E et al (2018) Relational deep reinforcement learning. arXiv:1806.01830

  39. Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv:1611.01578

Download references

Acknowledgements

This research was supported by the Natural Science Foundation of China under Grant No. U1811464, and was also supported in part by the Guangdong Natural Science Foundation under Grant No. 2018B030312002, in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant NO. 2016ZT06D211, in part by the CCF- Baidu Open Fund OF2021032, in part by the National Natural Science Foundation of China (NSFC) No. 61806223.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yingpeng Wen.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wen, Y., Su, Q., Shen, M. et al. Improving the exploration efficiency of DQNs via the confidence bound methods. Appl Intell 52, 15447–15461 (2022). https://doi.org/10.1007/s10489-022-03363-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03363-0

Keywords

Navigation