Skip to main content
Log in

Efficient and stable deep reinforcement learning: selective priority timing entropy

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Deep reinforcement learning (DRL) has made significant strides in addressing tasks with high-dimensional continuous action spaces. However, the field still faces the challenges of low sample utilization and insufficient exploration-exploitation balance, limiting the generalizability of algorithms across different environments. To effectively improve sample utilization, optimize the exploration-exploitation balance, and achieve higher rewards in tasks, this paper designs the selective priority timing entropy (SPTE) algorithm. Subsequently, selective prioritized experience replay (SPER) is proposed, which employs frequent replay of multiframe memories to enhance sample utilization and improve the stability of policy updates. Additionally, the temporal advantage with decay (TAD) method introduces a decay factor to help adjust the weights of the variance and bias, thereby reducing estimation errors. The reward mechanism is augmented with multientropy (ME) for entropy-regularized training, achieving a balance between information exploration and exploitation. Finally, experimental testing on the challenging Arcade platform demonstrated that the SPTE algorithm surpasses the average testing level of human players by 104.936%. Furthermore, compared to other algorithms, SPTE achieves an average score increase of over 32.75%, and it consistently outperforms the compared methods in more than 60% of tasks, indicating its strong adaptability and robustness.

Graphical abstract

The implementation process of SPTE

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets analysed during the current study are available in the OpenAI Gym library, https://www.gymlibrary.dev/.

References

  1. Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. IEEE Tran Neural Netw 9(5):1054–1054

    Article  Google Scholar 

  2. Kapturowski S, Campos V, Jiang R, Rakicevic N, Hasselt H, Blundell C, Badia AP (2023) Human-level atari 200x faster. In: The eleventh international conference on learning representations

  3. Güitta-López L, Boal J, López-López ÁJ (2023) Learning more with the same effort: how randomization improves the robustness of a robotic deep reinforcement learning agent. Appl Intell 53(12):14903–14917

    Article  Google Scholar 

  4. Luo F-M, Xu T, Lai H, Chen X-H, Zhang W, Yu Y (2024) A survey on model-based reinforcement learning. Sci Chin Inf Sci 67(2):121101

    Article  MathSciNet  Google Scholar 

  5. Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning

  6. Hu Z, Ding Y, Wu R, Li L, Zhang R, Hu Y, Qiu F, Zhang Z, Wang K, Zhao S, Zhang Y, Jiang J, Xi Y, Pu J, Zhang W, Wang S, Chen K, Zhou T, Chen J, Song Y, Lv T, Fan C (2023) Deep learning applications in games: a survey from a data perspective. Appl Intell 53:31129–31164

  7. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings

  8. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap TP, Hui F, Sifre L, Driessche G, Graepel T, Hassabis D (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359

    Article  Google Scholar 

  9. Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(7782):350–354

    Article  Google Scholar 

  10. Zuo G, Tian Z, Huang G (2023) A stable data-augmented reinforcement learning method with ensemble exploration and exploitation. Appl Intell 53:24792–24803

    Article  Google Scholar 

  11. Ladosz P, Weng L, Kim M, Oh H (2022) Exploration in deep reinforcement learning: A survey. Inf Fusion 85:1–22

    Article  Google Scholar 

  12. Chen E, Hong Z, Pajarinen J, Agrawal P (2022) Redeeming intrinsic rewards via constrained optimization. In: Advances in neural information processing systems 35: annual conference on neural information processing systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022

  13. Taiga AA, Agarwal R, Farebrother J, Courville A, Bellemare MG (2023) Investigating multi-task pretraining and generalization in reinforcement learning. In: The eleventh international conference on learning representations

  14. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. CoRR arXiv:1707.06347

  15. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller MA (2013) Playing atari with deep reinforcement learning. CoRR arXiv:1312.5602

  16. Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. CoRR arXiv:1707.06887

  17. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290

  18. Hessel M, Modayil J, Hasselt HV, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar MG, Silver D (2018) Rainbow: Combining improvements in deep reinforcement learning. In: AAAI conference on artificial intelligence

  19. Lazaridis A, Fachantidis A, Vlahavas I (2020) Deep reinforcement learning: A state-of-the-art walkthrough. J Artif Intell Res 69:1421–1471

    Article  MathSciNet  Google Scholar 

  20. Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv:1901.10995

  21. Zhang W, Song Y, Liu X, Shangguan Q-Q, An K (2023) A novel action decision method of deep reinforcement learning based on a neural network and confidence bound. Appl Intell 53:21299–21311

    Article  Google Scholar 

  22. Huang J, Tan Q, Qi R, Li H (2024) Relight: a random ensemble reinforcement learning based method for traffic light control. Appl Intell 54(1):95–112

    Article  Google Scholar 

  23. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller MA, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518:529–533

    Article  Google Scholar 

  24. Lin L (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8:293–321

    Article  Google Scholar 

  25. Moore AW, Atkeson CG (2004) Prioritized sweeping: Reinforcement learning with less data and less time. Mach Learn 13:103–130

    Article  Google Scholar 

  26. Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay. In: ICLR (Poster)

  27. Baxter J, Bartlett PL (2000) Reinforcement learning in pomdp’s via direct gradient ascent. In: ICML, pp 41–48

  28. Hafner R, Riedmiller M (2011) Reinforcement learning in feedback control: Challenges and benchmarks from technical process control. Mach Learn 84:137–169

    Article  MathSciNet  Google Scholar 

  29. Schulman J, Moritz P, Levine S, Jordan MI, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. In: 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings

  30. Hasselt HV, Guez A, Silver D (2015) Deep reinforcement learning with double q-learning. In: AAAI conference on artificial intelligence

  31. Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S (2018) Noisy networks for exploration. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings

  32. D’Oro P, Schwarzer M, Nikishin E, Bacon P-L, Bellemare MG, Courville A (2022) Sample-efficient reinforcement learning by breaking the replay ratio barrier. In: Deep Reinforcement Learning Workshop NeurIPS 2022

  33. Lee H, Cho H, Kim H, Gwak D, Kim J, Choo J, Yun S-Y, Yun C (2024) Plastic: Improving input and label plasticity for sample efficient reinforcement learning. Adv Neural Inf Process Syst 36

  34. Nikishin E, Oh J, Ostrovski G, Lyle C, Pascanu R, Dabney W, Barreto A (2024) Deep reinforcement learning with plasticity injection. Adv Neural Inf Process Syst 36

  35. Sokar G, Agarwal R, Castro P, Evci U (2023) The dormant neuron phenomenon in deep reinforcement learning. In: International conference on machine learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Proceedings of Machine Learning Research, vol 202, pp 32145–32168

  36. Bhardwaj M, Xie T, Boots B, Jiang N, Cheng C-A (2024) Adversarial model for offline reinforcement learning. Adv Neural Inf Process Syst 36

  37. Cagatan OV, Akgun B (2024) Barlowrl: Barlow twins for data-efficient reinforcement learning. In: Asian conference on machine learning, pp 201–216. PMLR

  38. Hao J, Yang T, Tang H, Bai C, Liu J, Meng Z, Liu P, Wang Z (2023) Exploration in deep reinforcement learning: From single-agent to multiagent domain. IEEE Trans Neural Netw Learn Syst

  39. Abbas Z, Zhao R, Modayil J, White A, Machado MC (2023) Loss of plasticity in continual deep reinforcement learning. In: Conference on lifelong learning agents, 22-25 August 2023, McGill University, Montréal, Québec, Canada. Proceedings of Machine Learning Research, vol 232, pp 620–636

  40. Schulman J, Levine S, Abbeel P, Jordan MI, Moritz P (2015) Trust region policy optimization. In: Proceedings of the 32nd International conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR Workshop and Conference Proceedings, vol 37, pp 1889–1897

  41. Gruslys A, Dabney W, Azar MG, Piot B, Bellemare MG, Munos R (2018) The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings

  42. Rawlik K, Toussaint M, Vijayakumar S (2013) On stochastic optimal control and reinforcement learning by approximate inference (extended abstract). In: International joint conference on artificial intelligence, pp 3052–3056

  43. Fox R, Pakman A, Tishby N (2015) G-learning: Taming the noise in reinforcement learning via soft updates. CoRR arXiv:1512.08562

  44. Levine S, Finn C, Darrell T, Abbeel P (2016) End-to-end training of deep visuomotor policies. J Mach Learn Res 17(39):1–40

    MathSciNet  Google Scholar 

  45. Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Proceedings of Machine Learning Research, vol 70, pp 1352–1361

  46. Cohen A, Yu L, Qiao X, Tong X (2019) Maximum entropy diverse exploration: Disentangling maximum entropy reinforcement learning. CoRR arXiv:1911.00828

  47. Gangwani T, Liu Q, Peng J (2019) Learning self-imitating diverse policies. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019

  48. Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the Thirty-Second AAAI conference on artificial intelligence, (AAAI-18), the 30th innovative applications of artificial intelligence (IAAI-18), and the 8th AAAI symposium on educational advances in artificial intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp 3207–3214

  49. Li Y, Xu J, Han L, Luo Z (2024) Hyperagent: A simple, scalable, efficient and provable reinforcement learning framework for complex environments. CoRR. arXiv:2402.10228

  50. Schwarzer M, Obando-Ceron JS, Courville AC, Bellemare MG, Agarwal R, Castro PS (2023) Bigger, better, faster: Human-level atari with human-level efficiency. In: Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J (eds) International conference on machine learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Proceedings of Machine Learning Research, vol 202, pp 30365–30380

  51. Tiapkin D, Belomestny D, Moulines E, Naumov A, Samsonov S, Tang Y, Valko M, Ménard P (2022) From dirichlet to rubin: Optimistic exploration in RL without bonuses. In: International conference on machine learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. Proceedings of Machine Learning Research, vol 162, pp 21380–21431

  52. Eberhard O, Hollenstein JJ, Pinneri C, Martius G (2023) Pink noise is all you need: Colored noise exploration in deep reinforcement learning. In: The eleventh international conference on learning representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

  53. Obando-Ceron JS, Bellemare MG, Castro PS (2023) The small batch size anomaly in multistep deep reinforcement learning. In: Maughan K, Liu R, Burns TF (eds) The first tiny papers track at ICLR 2023, Tiny Papers @ ICLR 2023, Kigali, Rwanda, May 5, 2023

  54. Bellman R (1957) A markovian decision process. Indiana Univ Math J 6:679–684

    Article  MathSciNet  Google Scholar 

  55. Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  56. Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: Balcan M, Weinberger KQ (eds) Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. JMLR Workshop and Conference Proceedings, vol 48, pp 1995–2003

  57. Asadi K, Misra D, Kim S, Littman ML (2019) Combating the compounding-error problem with a multi-step model. CoRR arXiv:1905.13320

  58. Ziebart BD (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, Carnegie Mellon University, USA

  59. Bellemare MG, Naddaf Y, Veness J, Bowling M (2015) The arcade learning environment: An evaluation platform for general agents (extended abstract). In: Yang Q, Wooldridge MJ (eds) Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pp 4148–4152

  60. Kemmerling M, Lütticke D, Schmitt RH (2023) Beyond games: a systematic review of neural monte carlo tree search applications. Appl Intell 54(1):1020–1046

    Article  Google Scholar 

  61. Nair A, Srinivasan P, Blackwell S, Alcicek C, Fearon R, Maria AD, Panneershelvam V, Suleyman M, Beattie C, Petersen S, Legg S, Mnih V, Kavukcuoglu K, Silver D (2015) Massively parallel methods for deep reinforcement learning. CoRR arXiv:1507.04296

  62. Zhang L, Tang K, Yao X (2019) Explicit planning for efficient exploration in reinforcement learning. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 7486–7495

  63. Obando-Ceron JS, Courville AC, Castro PS (2024) In deep reinforcement learning, a pruned network is a good network. CoRR arXiv:2402.12479

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62263017), in part by the Basic Research Program Project of Yunnan Province, China (Project No. 202301AU070059) and in part by the Kunming University of Science and Technology college level personnel training project (Project No. KKZ3202301041).

Author information

Authors and Affiliations

Authors

Contributions

Lin Huo: Conceived the innovative algorithm, contributed to the overall conceptualization of the paper, and played a key role in manuscript writing. Jianlin Mao: Designed the experimental methodology, contributed to the writing and editing of the manuscript, and provided financial support. Hongjun San: Conducted manuscript review and writing refinement, and provided financial support. Shufan Zhang: Conducted data analysis for the study. Ruiqi Li: Tested existing programs, organized and compiled the data. Lixia Fu: Maintained the research data.

Corresponding author

Correspondence to Jianlin Mao.

Ethics declarations

Competing Interests

The authors declare that there are no competing interests associated with this research.

Ethical and Informed Consent for Data Used

This research adheres to ethical standards, and all individuals involved in the study were fully informed and provided consent for the use of their data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix. Parameter symbols

Appendix. Parameter symbols

See Table 4.

Table 4 Parameter symbols

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huo, L., Mao, J., San, H. et al. Efficient and stable deep reinforcement learning: selective priority timing entropy. Appl Intell 54, 10224–10241 (2024). https://doi.org/10.1007/s10489-024-05705-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05705-6

Keywords

Navigation