Skip to main content
Log in

A Novel Heuristic Exploration Method Based on Action Effectiveness Constraints to Relieve Loop Enhancement Effect in Reinforcement Learning with Sparse Rewards

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

In realistic sparse reward tasks, existing theoretical methods cannot be effectively applied due to the low sampling probability ofrewarded episodes. Profound research on methods based on intrinsic rewards has been conducted to address this issue, but exploration with sparse rewards remains a great challenge. This paper describes the loop enhancement effect in exploration processes with sparse rewards. After each fully trained iteration, the execution probability of ineffective actions is higher than thatof other suboptimal actions, which violates biological habitual behavior principles and is not conducive to effective training. This paper proposes corresponding theorems of relieving the loop enhancement effect in the exploration process with sparse rewards and a heuristic exploration method based on action effectiveness constraints (AEC), which improves policy training efficiency by relieving the loop enhancement effect. Inspired by the fact that animals form habitual behaviors and goal-directed behaviors through the dorsolateral striatum and dorsomedial striatum. The function of the dorsolateral striatum is simulated by an action effectiveness evaluation mechanism (A2EM), which aims to reduce the rate of ineffective samples and improve episode reward expectations. The function of the dorsomedial striatum is simulated by an agent policy network, which aims to achieve task goals. The iterative training of A2EM and the policy forms the AEC model structure. A2EM provides effective samples for the agent policy; the agent policy provides training constraints for A2EM. The experimental results show that A2EM can relieve the loop enhancement effect and has good interpretability and generalizability. AEC enables agents to effectively reduce the loop rate in samples, can collect more effective samples, and improve the efficiency of policy training. The performance of AEC demonstrates the effectiveness of a biological heuristic approach that simulates the function of the dorsal striatum. This approach can be used to improve the robustness of agent exploration with sparse rewards.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

Not applicable.

Code Availability

Not applicable.

References

  1. Eryilmaz H, Rodriguez-Thompson A, Tanner AS, et al. Neural determinants of human goal-directed vs. habitual action control and their relation to trait motivation. Sci Rep. 2017;7(1):6002.

    Article  Google Scholar 

  2. Choi K, Piasini E, Díaz-Hernández E, et al. Distributed processing for value-based choice by prelimbic circuits targeting anterior-posterior dorsal striatal subregions in male mice. Nat Commun. 2023;14(1):1920.

    Article  Google Scholar 

  3. Villet M, Reynaud-Bouret P, Poitreau J, et al. Coding dynamics of the striatal networks during learning. bioRxiv, 2023: 2023.07. 24.550305.

  4. Briones BA, Pitcher MN, Fleming WT, et al. Perineuronal nets in the dorsomedial striatum contribute to behavioral dysfunction in mouse models of excessive repetitive behavior. Biol Psychiatry Global Open Sci. 2022;2(4):460–9.

    Article  Google Scholar 

  5. Vandaele Y, Ottenheimer DJ, Janak PH. Dorsomedial striatal activity tracks completion of behavioral sequences in rats. Eneuro. 2021;8(6).

  6. Heneman RL. Strategic reward management: design, implementations, and evaluation. IAP. 2002.

  7. Randløv J, Alstrøm P. Learning to drive a bicycle using reinforcement learning and shaping. ICML. 1998;98:463–71.

    Google Scholar 

  8. Xu ZX, Chen XL, Cao L, et al. A study of count-based exploration and bonus for reinforcement learning. In: 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). IEEE; 2017. p. 425–9.

  9. Baird LC. Reinforcement learning in continuous time: advantage updating. In: International Conference on Neural Networks (ICNN’94), vol. 4. IEEE; 1994. p. 2448–53.

  10. Cho H, Oh P, Park J, et al. Fa3c: Fpga-accelerated deep reinforcement learning. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 2019. p. 499–513.

  11. Liu Y, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning: a survey. In: The 30th International Joint Conference on Artificial Intelligence (IJCAI). 2021.

  12. Yang Y, Jiang Y, Liu Y, et al. Model-free safe reinforcement learning through neural barrier certificate. IEEE Robot Autom Lett. 2023;8(3):1295–302.

    Article  Google Scholar 

  13. Sutton RS, Barto AG. Reinforcement learning: an introduction. MIT Press; 2018.

  14. Puterman ML. Markov decision processes. Handbooks Oper Res Manage Sci. 1990;2:331–434.

    Article  MathSciNet  Google Scholar 

  15. Wang X, Wang L, Dong C, et al. An online deep reinforcement learning-based order recommendation framework forrider-centered food delivery system. IEEE Trans Intell Transport Syst. 2023.

  16. Xin X, Tu Y, Stojanovic V, et al. Online reinforcement learning multiplayer non-zero sum games of continuous-time Markov jump linear systems. Appl Math Comput. 2022;412:126537.

    MathSciNet  Google Scholar 

  17. Dogru O, Wieczorek N, Velswamy K, et al. Online reinforcement learning for a continuous space system with experimental validation. J Process Control. 2021;104:86–100.

    Article  Google Scholar 

  18. Prudencio RF, Maximo MROA, Colombini EL. A survey on offline reinforcement learning: taxonomy, review, and open problems. IEEE Trans Neural Netw Learn Syst. 2023.

  19. Rome S, Chen T, Kreisel M, et al. Lessons on off-policy methods from a notification component of a chatbot. Mach Learn. 2021;110(9):2577–602.

    Article  MathSciNet  Google Scholar 

  20. Cunningham P, Cord M, Delany SJ. Supervised learning. Machine learning techniques for multimedia. Berlin, Heidelberg: Springer; 2008. p. 21–49.

  21. Learned-Miller EG. Introduction to supervised learning. I: Department of Computer Science, University of Massachusetts; 2014.p. 3.

  22. Andrychowicz M, Raichuk A, Stańczyk P, et al. What matters in on-policy reinforcement learning? A large-scale empirical study. arXiv:2006.05990 [Preprint]. 2020. Available from: http://arxiv.org/abs/2006.05990.

  23. Liu Y, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning: a survey. In: The 30th International Joint Conference on Artificial Intelligence (IJCAI). 2021.

  24. De Asis K, Hernandez-Garcia J, Holland G, et al. Multi-step reinforcement learning: a unifying algorithm. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1). 2018.

  25. Witty S, Lee JK, Tosch E, et al. Measuring and characterizing generalization in deep reinforcement learning. Appl AI Lett. 2021;2(4):e45.

    Article  Google Scholar 

  26. Zhang J, Kim J, O’Donoghue B, et al. Sample efficient reinforcement learning with REINFORCE. Proc AAAI Conf Artif Intell. 2021;35(12):10887–95.

    Google Scholar 

  27. Memarian F, Goo W, Lioutikov R, et al. Self-supervised online reward shaping in sparse-reward environments. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2021. p. 2369–75.

  28. Zou H, Ren T, Yan D, et al. Reward shaping via meta-learning. arXiv:1901.09330 [Preprint]. 2019. Available from: http://arxiv.org/abs/1901.09330.

  29. Anca M, Studley M, Hansen M, et al. Effects of reward shaping on curriculum learning in goal conditioned tasks. arXiv:2206.02462 [Preprint]. 2022. Available from: http://arxiv.org/abs/2206.02462.

  30. ElSayed-Aly I, Feng L. Logic-based reward shaping for multi-agent reinforcement learning. arXiv:2206.08881 [Preprint]. 2022. Available from: http://arxiv.org/abs/2206.08881.

  31. Sun H, Han L, Yang R, et al. Exploiting reward shifting in value-based deep RL. arXiv:2209.07288 [Preprint]. 2022. Available from: http://arxiv.org/abs/2209.07288.

  32. Ng AY, Harada D, Russell S. Policy invariance under reward transformations: theory and application to reward shaping. ICML. 1999;99:278–87.

    Google Scholar 

  33. Devlin SM, Kudenko D. Dynamic potential-based reward shaping. In: 11th International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS; 2012. p. 433–40.

  34. Amodei D, Olah C, Steinhardt J, et al. Concrete problems in AI safety. arXiv:1606.06565 [Preprint]. 2016. Available from: http://arxiv.org/abs/1606.06565.

  35. Grzes M. Reward shaping in episodic reinforcement learning. 2017.

  36. Bellemare M, Srinivasan S, Ostrovski G, et al. Unifying count-based exploration and intrinsic motivation. Adv Neural Inf Process Syst. 2016;29.

  37. Dong K, Wang Y, Chen X, et al. Q-learning with UCB exploration is sample efficient for infinite-horizon mdp. arXiv:1901.09311 [Preprint]. 2019. Available from: http://arxiv.org/abs/1901.09311.

  38. Jaegle A, Mehrpour V, Rust N. Visual novelty, curiosity, and intrinsic reward in machine learning and the brain. Curr Opin Neurobiol. 2019;58:167–74.

    Article  Google Scholar 

  39. Strehl AL, Littman ML. An analysis of model-based interval estimation for Markov decision processes. J Comput Syst Sci. 2008;74(8):1309–31.

    Article  MathSciNet  Google Scholar 

  40. Bigazzi R, Landi F, Cascianelli S, et al. Focus on impact: indoor exploration with intrinsic motivation. IEEE Robot Autom Lett. 2022;7(2):2985–92.

    Article  Google Scholar 

  41. Honda J, Takemura A. An asymptotically optimal bandit algorithm for bounded support models. COLT. 2010:67–79.

  42. Brafman RI, Tennenholtz M. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res. 2002;3(Oct):213–31.

  43. Yu JY, Mannor S, Shimkin N. Markov decision processes with arbitrary reward processes. Math Oper Res. 2009;34(3):737–57.

    Article  MathSciNet  Google Scholar 

  44. Yao Y, Xiao L, An Z, et al. Sample efficient reinforcement learning via model-ensemble exploration and exploitation. In: International Conference on Robotics and Automation (ICRA). IEEE; 2021.p. 4202–8.

  45. Burda Y, Edwards H, Storkey A, et al. Exploration by random network distillation. arXiv:1810.12894 [Preprint]. 2018. Available from: http://arxiv.org/abs/1810.12894.

  46. Subramanian K, Isbell CL Jr, Thomaz AL. Exploration from demonstration for interactive reinforcement learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016. p. 447–56.

  47. Pathak D, Agrawal P, Efros AA, et al. Curiosity-driven exploration by self-supervised prediction. Int Conf Mach Learn. PMLR. 2017:2778–2787.

  48. Ryan RM, Deci EL. Intrinsic and extrinsic motivations: classic definitions and new directions. Contemp Educ Psychol. 2000;25(1):54–67.

    Article  Google Scholar 

  49. Manoury A, Buche C. Chime: an adaptive hierarchical representation for continuous intrinsically motivated exploration. In: 2019 Third IEEE International Conference on Robotic Computing (IRC). IEEE; 2019. p. 167–70.

  50. Gordon G. Infant-inspired intrinsically motivated curious robots. Curr Opin Behav Sci. 2020;35:28–34.

    Article  Google Scholar 

  51. Hellman RB, Tekin C, van der Schaar M, et al. Functional contour-following via haptic perception and reinforcement learning. IEEE Trans Haptics. 2017;11(1):61–72.

    Article  Google Scholar 

  52. D’Eramo C, Cini A, Restelli M. Exploiting action-value uncertainty to drive exploration in reinforcement learning. In: International Joint Conference on Neural Networks (IJCNN). IEEE; 2019.p. 1–8.

  53. Osband I, Van Roy B, Russo DJ, et al. Deep exploration via randomized value functions. J Mach Learn Res. 2019;20(124):1–62.

    MathSciNet  Google Scholar 

  54. Klyubin AS, Polani D, Nehaniv CL. All else being equal be empowered. In: European Conference on Artificial Life. Berlin, Heidelberg: Springer; 2005. p. 744–53.

  55. Rezende D, Mohamed S. Variational inference with normalizing flows. Int Conf Mach Learn. PMLR. 2015:1530–8.

  56. Schmidhuber J. A possibility for implementing curiosity and boredom in model-building neural controllers. Int Conf Simul Adapt Behav. 1991:222–7.

  57. Gottlieb J, Oudeyer PY, Lopes M, et al. Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends Cogn Sci. 2013;17(11):585–93.

    Article  Google Scholar 

  58. Schmidhuber J. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connect Sci. 2006;18(2):173–87.

    Article  Google Scholar 

  59. Stadie B C, Levine S, Abbeel P. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv:1507.00814 [Preprint]. 2015. Available from: http://arxiv.org/abs/1507.00814.

  60. Parisi S, Dean V, Pathak D, et al. Interesting object, curious agent: learning task-agnostic exploration. Adv Neural Inf Process Syst. 2021;34:20516–30.

    Google Scholar 

  61. Raileanu R, Rocktäschel T. Ride: rewarding impact-driven exploration for procedurally-generated environments. arXiv:2002.12292 [Preprint]. 2020. Available from: http://arxiv.org/abs/2002.12292.

  62. Turner KM, Svegborn A, Langguth M, et al. Opposing roles of the dorsolateral and dorsomedial striatum in the acquisition of skilled action sequencing in rats. J Neurosci. 2022;42(10):2039–51.

    Article  Google Scholar 

  63. Kang S, Hong SI, Lee J, et al. Activation of astrocytes in the dorsomedial striatum facilitates transition from habitual to goal-directed reward-seeking behavior. Biol Psychiat. 2020;88(10):797–808.

    Article  Google Scholar 

  64. Gremel CM, Costa RM. Orbitofrontal and striatal circuits dynamically encode the shift between goal-directed and habitual actions. Nat Commun. 2013;4(1):2264.

  65. Rengarajan D, Vaidya G, Sarvesh A, et al. Reinforcement learning with sparse rewards using guidance from offline demonstration. arXiv:2202.04628 [Preprint]. 2022. Available from: http://arxiv.org/abs/2202.04628.

Download references

Funding

This work was supported by the National Natural Science Foundation of China (Key Program) (Grant number 51935005). Author Peng Liu has received research support from it. This work was supported by Basic Scientific Research Projects of China (Grant number JCKY20200603C010). Author Peng Liu has received research support from it. This work was supported by Science and Technology Program Projects of Heilongjiang Province, China (Grant number GA21C031). Author Ye Jin has received research support from it.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by ZN, YJ, PL, and WZ. The first draft of the manuscript was written by ZN and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ye Jin.

Ethics declarations

Ethics Approval

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ni, Z., Jin, Y., Liu, P. et al. A Novel Heuristic Exploration Method Based on Action Effectiveness Constraints to Relieve Loop Enhancement Effect in Reinforcement Learning with Sparse Rewards. Cogn Comput 16, 682–700 (2024). https://doi.org/10.1007/s12559-023-10226-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-023-10226-4

Keywords