Efficient and stable deep reinforcement learning: selective priority timing entropy

Huo, Lin; Mao, Jianlin; San, Hongjun; Zhang, Shufan; Li, Ruiqi; Fu, Lixia

doi:10.1007/s10489-024-05705-6

Efficient and stable deep reinforcement learning: selective priority timing entropy

Published: 09 August 2024

Volume 54, pages 10224–10241, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

156 Accesses
Explore all metrics

Abstract

Deep reinforcement learning (DRL) has made significant strides in addressing tasks with high-dimensional continuous action spaces. However, the field still faces the challenges of low sample utilization and insufficient exploration-exploitation balance, limiting the generalizability of algorithms across different environments. To effectively improve sample utilization, optimize the exploration-exploitation balance, and achieve higher rewards in tasks, this paper designs the selective priority timing entropy (SPTE) algorithm. Subsequently, selective prioritized experience replay (SPER) is proposed, which employs frequent replay of multiframe memories to enhance sample utilization and improve the stability of policy updates. Additionally, the temporal advantage with decay (TAD) method introduces a decay factor to help adjust the weights of the variance and bias, thereby reducing estimation errors. The reward mechanism is augmented with multientropy (ME) for entropy-regularized training, achieving a balance between information exploration and exploitation. Finally, experimental testing on the challenging Arcade platform demonstrated that the SPTE algorithm surpasses the average testing level of human players by 104.936%. Furthermore, compared to other algorithms, SPTE achieves an average score increase of over 32.75%, and it consistently outperforms the compared methods in more than 60% of tasks, indicating its strong adaptability and robustness.

Graphical abstract

The implementation process of SPTE

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Guiding Task Learning by Hierarchical RL with an Experience Replay Mechanism Through Reward Machines

Balanced prioritized experience replay in off-policy reinforcement learning

Article 18 May 2024

A General Unbiased Training Framework for Deep Reinforcement Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets analysed during the current study are available in the OpenAI Gym library, https://www.gymlibrary.dev/.

References

Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. IEEE Tran Neural Netw 9(5):1054–1054
Article Google Scholar
Kapturowski S, Campos V, Jiang R, Rakicevic N, Hasselt H, Blundell C, Badia AP (2023) Human-level atari 200x faster. In: The eleventh international conference on learning representations
Güitta-López L, Boal J, López-López ÁJ (2023) Learning more with the same effort: how randomization improves the robustness of a robotic deep reinforcement learning agent. Appl Intell 53(12):14903–14917
Article Google Scholar
Luo F-M, Xu T, Lai H, Chen X-H, Zhang W, Yu Y (2024) A survey on model-based reinforcement learning. Sci Chin Inf Sci 67(2):121101
Article MathSciNet Google Scholar
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning
Hu Z, Ding Y, Wu R, Li L, Zhang R, Hu Y, Qiu F, Zhang Z, Wang K, Zhao S, Zhang Y, Jiang J, Xi Y, Pu J, Zhang W, Wang S, Chen K, Zhou T, Chen J, Song Y, Lv T, Fan C (2023) Deep learning applications in games: a survey from a data perspective. Appl Intell 53:31129–31164
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap TP, Hui F, Sifre L, Driessche G, Graepel T, Hassabis D (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359
Article Google Scholar
Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(7782):350–354
Article Google Scholar
Zuo G, Tian Z, Huang G (2023) A stable data-augmented reinforcement learning method with ensemble exploration and exploitation. Appl Intell 53:24792–24803
Article Google Scholar
Ladosz P, Weng L, Kim M, Oh H (2022) Exploration in deep reinforcement learning: A survey. Inf Fusion 85:1–22
Article Google Scholar
Chen E, Hong Z, Pajarinen J, Agrawal P (2022) Redeeming intrinsic rewards via constrained optimization. In: Advances in neural information processing systems 35: annual conference on neural information processing systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022
Taiga AA, Agarwal R, Farebrother J, Courville A, Bellemare MG (2023) Investigating multi-task pretraining and generalization in reinforcement learning. In: The eleventh international conference on learning representations
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. CoRR arXiv:1707.06347
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller MA (2013) Playing atari with deep reinforcement learning. CoRR arXiv:1312.5602
Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. CoRR arXiv:1707.06887
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290
Hessel M, Modayil J, Hasselt HV, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar MG, Silver D (2018) Rainbow: Combining improvements in deep reinforcement learning. In: AAAI conference on artificial intelligence
Lazaridis A, Fachantidis A, Vlahavas I (2020) Deep reinforcement learning: A state-of-the-art walkthrough. J Artif Intell Res 69:1421–1471
Article MathSciNet Google Scholar
Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv:1901.10995
Zhang W, Song Y, Liu X, Shangguan Q-Q, An K (2023) A novel action decision method of deep reinforcement learning based on a neural network and confidence bound. Appl Intell 53:21299–21311
Article Google Scholar
Huang J, Tan Q, Qi R, Li H (2024) Relight: a random ensemble reinforcement learning based method for traffic light control. Appl Intell 54(1):95–112
Article Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller MA, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518:529–533
Article Google Scholar
Lin L (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8:293–321
Article Google Scholar
Moore AW, Atkeson CG (2004) Prioritized sweeping: Reinforcement learning with less data and less time. Mach Learn 13:103–130
Article Google Scholar
Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay. In: ICLR (Poster)
Baxter J, Bartlett PL (2000) Reinforcement learning in pomdp’s via direct gradient ascent. In: ICML, pp 41–48
Hafner R, Riedmiller M (2011) Reinforcement learning in feedback control: Challenges and benchmarks from technical process control. Mach Learn 84:137–169
Article MathSciNet Google Scholar
Schulman J, Moritz P, Levine S, Jordan MI, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. In: 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings
Hasselt HV, Guez A, Silver D (2015) Deep reinforcement learning with double q-learning. In: AAAI conference on artificial intelligence
Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S (2018) Noisy networks for exploration. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings
D’Oro P, Schwarzer M, Nikishin E, Bacon P-L, Bellemare MG, Courville A (2022) Sample-efficient reinforcement learning by breaking the replay ratio barrier. In: Deep Reinforcement Learning Workshop NeurIPS 2022
Lee H, Cho H, Kim H, Gwak D, Kim J, Choo J, Yun S-Y, Yun C (2024) Plastic: Improving input and label plasticity for sample efficient reinforcement learning. Adv Neural Inf Process Syst 36
Nikishin E, Oh J, Ostrovski G, Lyle C, Pascanu R, Dabney W, Barreto A (2024) Deep reinforcement learning with plasticity injection. Adv Neural Inf Process Syst 36
Sokar G, Agarwal R, Castro P, Evci U (2023) The dormant neuron phenomenon in deep reinforcement learning. In: International conference on machine learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Proceedings of Machine Learning Research, vol 202, pp 32145–32168
Bhardwaj M, Xie T, Boots B, Jiang N, Cheng C-A (2024) Adversarial model for offline reinforcement learning. Adv Neural Inf Process Syst 36
Cagatan OV, Akgun B (2024) Barlowrl: Barlow twins for data-efficient reinforcement learning. In: Asian conference on machine learning, pp 201–216. PMLR
Hao J, Yang T, Tang H, Bai C, Liu J, Meng Z, Liu P, Wang Z (2023) Exploration in deep reinforcement learning: From single-agent to multiagent domain. IEEE Trans Neural Netw Learn Syst
Abbas Z, Zhao R, Modayil J, White A, Machado MC (2023) Loss of plasticity in continual deep reinforcement learning. In: Conference on lifelong learning agents, 22-25 August 2023, McGill University, Montréal, Québec, Canada. Proceedings of Machine Learning Research, vol 232, pp 620–636
Schulman J, Levine S, Abbeel P, Jordan MI, Moritz P (2015) Trust region policy optimization. In: Proceedings of the 32nd International conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR Workshop and Conference Proceedings, vol 37, pp 1889–1897
Gruslys A, Dabney W, Azar MG, Piot B, Bellemare MG, Munos R (2018) The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings
Rawlik K, Toussaint M, Vijayakumar S (2013) On stochastic optimal control and reinforcement learning by approximate inference (extended abstract). In: International joint conference on artificial intelligence, pp 3052–3056
Fox R, Pakman A, Tishby N (2015) G-learning: Taming the noise in reinforcement learning via soft updates. CoRR arXiv:1512.08562
Levine S, Finn C, Darrell T, Abbeel P (2016) End-to-end training of deep visuomotor policies. J Mach Learn Res 17(39):1–40
MathSciNet Google Scholar
Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Proceedings of Machine Learning Research, vol 70, pp 1352–1361
Cohen A, Yu L, Qiao X, Tong X (2019) Maximum entropy diverse exploration: Disentangling maximum entropy reinforcement learning. CoRR arXiv:1911.00828
Gangwani T, Liu Q, Peng J (2019) Learning self-imitating diverse policies. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019
Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the Thirty-Second AAAI conference on artificial intelligence, (AAAI-18), the 30th innovative applications of artificial intelligence (IAAI-18), and the 8th AAAI symposium on educational advances in artificial intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp 3207–3214
Li Y, Xu J, Han L, Luo Z (2024) Hyperagent: A simple, scalable, efficient and provable reinforcement learning framework for complex environments. CoRR. arXiv:2402.10228
Schwarzer M, Obando-Ceron JS, Courville AC, Bellemare MG, Agarwal R, Castro PS (2023) Bigger, better, faster: Human-level atari with human-level efficiency. In: Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J (eds) International conference on machine learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Proceedings of Machine Learning Research, vol 202, pp 30365–30380
Tiapkin D, Belomestny D, Moulines E, Naumov A, Samsonov S, Tang Y, Valko M, Ménard P (2022) From dirichlet to rubin: Optimistic exploration in RL without bonuses. In: International conference on machine learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. Proceedings of Machine Learning Research, vol 162, pp 21380–21431
Eberhard O, Hollenstein JJ, Pinneri C, Martius G (2023) Pink noise is all you need: Colored noise exploration in deep reinforcement learning. In: The eleventh international conference on learning representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023
Obando-Ceron JS, Bellemare MG, Castro PS (2023) The small batch size anomaly in multistep deep reinforcement learning. In: Maughan K, Liu R, Burns TF (eds) The first tiny papers track at ICLR 2023, Tiny Papers @ ICLR 2023, Kigali, Rwanda, May 5, 2023
Bellman R (1957) A markovian decision process. Indiana Univ Math J 6:679–684
Article MathSciNet Google Scholar
Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: Balcan M, Weinberger KQ (eds) Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. JMLR Workshop and Conference Proceedings, vol 48, pp 1995–2003
Asadi K, Misra D, Kim S, Littman ML (2019) Combating the compounding-error problem with a multi-step model. CoRR arXiv:1905.13320
Ziebart BD (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, Carnegie Mellon University, USA
Bellemare MG, Naddaf Y, Veness J, Bowling M (2015) The arcade learning environment: An evaluation platform for general agents (extended abstract). In: Yang Q, Wooldridge MJ (eds) Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pp 4148–4152
Kemmerling M, Lütticke D, Schmitt RH (2023) Beyond games: a systematic review of neural monte carlo tree search applications. Appl Intell 54(1):1020–1046
Article Google Scholar
Nair A, Srinivasan P, Blackwell S, Alcicek C, Fearon R, Maria AD, Panneershelvam V, Suleyman M, Beattie C, Petersen S, Legg S, Mnih V, Kavukcuoglu K, Silver D (2015) Massively parallel methods for deep reinforcement learning. CoRR arXiv:1507.04296
Zhang L, Tang K, Yao X (2019) Explicit planning for efficient exploration in reinforcement learning. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 7486–7495
Obando-Ceron JS, Courville AC, Castro PS (2024) In deep reinforcement learning, a pruned network is a good network. CoRR arXiv:2402.12479

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62263017), in part by the Basic Research Program Project of Yunnan Province, China (Project No. 202301AU070059) and in part by the Kunming University of Science and Technology college level personnel training project (Project No. KKZ3202301041).

Author information

Authors and Affiliations

Faculty of Mechanical and Electrical Engineering, Kunming University of Science and Technology, No.727 Jingming South Road, Kunming, 650500, Yunnan, China
Lin Huo, Hongjun San, Shufan Zhang & Ruiqi Li
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, No.727 Jingming South Road, Kunming, 650500, Yunnan, China
Jianlin Mao & Lixia Fu

Authors

Lin Huo
View author publications
You can also search for this author in PubMed Google Scholar
Jianlin Mao
View author publications
You can also search for this author in PubMed Google Scholar
Hongjun San
View author publications
You can also search for this author in PubMed Google Scholar
Shufan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ruiqi Li
View author publications
You can also search for this author in PubMed Google Scholar
Lixia Fu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Lin Huo: Conceived the innovative algorithm, contributed to the overall conceptualization of the paper, and played a key role in manuscript writing. Jianlin Mao: Designed the experimental methodology, contributed to the writing and editing of the manuscript, and provided financial support. Hongjun San: Conducted manuscript review and writing refinement, and provided financial support. Shufan Zhang: Conducted data analysis for the study. Ruiqi Li: Tested existing programs, organized and compiled the data. Lixia Fu: Maintained the research data.

Corresponding author

Correspondence to Jianlin Mao.

Ethics declarations

Competing Interests

The authors declare that there are no competing interests associated with this research.

Ethical and Informed Consent for Data Used

This research adheres to ethical standards, and all individuals involved in the study were fully informed and provided consent for the use of their data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix. Parameter symbols

See Table 4.

Table 4 Parameter symbols

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huo, L., Mao, J., San, H. et al. Efficient and stable deep reinforcement learning: selective priority timing entropy. Appl Intell 54, 10224–10241 (2024). https://doi.org/10.1007/s10489-024-05705-6

Download citation

Accepted: 27 July 2024
Published: 09 August 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s10489-024-05705-6

Keywords