Abstract
Experience replay, which stores past samples for reuse, has become a fundamental component of off-policy reinforcement learning. Some pioneering works have indicated that prioritization or reweighting of samples with on-policiness can yield significant performance improvements. However, this method doesn’t pay enough attention to sample diversity, which may result in instability or even long-term performance slumps. In this work, we introduce a novel Re-attention criterion to reevaluate recent experiences, thus benefiting the agent from learning about them. We call this overall algorithm, Re-attentive Experience Replay (RAER). RAER employs a parameter-insensitive dynamic testing technique to enhance the attention of samples generated by policies with promising trends in overall performance. By wisely leveraging diverse samples, RAER fulfills the positive effects of on-policiness while avoiding its potential negative influences. Extensive experiments demonstrate the effectiveness of RAER in improving both performance and stability. Moreover, replacing the on-policiness component of the state-of-the-art approach with RAER can yield significant benefits.
Similar content being viewed by others
Availability of data and materials
The datasets used in experiments are all free-of-use. We provided the data sources in the references of the paper.
Code availability
The code is available on https://DkING-lv6.github.io/RAER/.
References
Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th international conference on machine learning (pp. 22–31).
Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A.C., & Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice. In: Proceedings of the 35th conference on neural information processing systems (pp. 29304–29320).
Anschel, O., Baram, N., & Shimkin, N. (2017). Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In Proceedings of the 34th international conference on machine learning (pp. 176–185).
Csiszár, I. (1964). Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akadémia Matematikai Kutató Intézetének Közleményei, 8, 85–108.
Dasagi, V., Bruce, J., Peynot, T., & Leitner, J. (2019). Ctrl-z: recovering from instability in reinforcement learning. CoRR arXiv:1910.03732 .
de Bruin, T., Kober, J., Tuyls, K., & Babuska, R. (2015). The importance of experience replay database composition in deep reinforcement learning. In Proceedings of the 29th conference on neural information processing systems.
Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., & Clune, J. (2021). First return, then explore. Nature, 590(7847), 580–586.
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In Proceedings of the 35th international conference on machine learning (pp. 1587–1596).
Fujimoto, S., Meger, D., & Precup, D. (2020). An equivalence between loss functions and non-uniform sampling in experience replay. In Proceedings of the 34th conference on neural information processing systems (pp. 14219–14230).
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th international conference on machine learning (pp. 1861–1870).
Han, S., & Sung, Y. (2021). A max-min entropy framework for reinforcement learning. In Proceedings of the 35th conference on neural information processing systems (pp. 25732–25745).
Hanna, J. P., Niekum, S., & Stone, P. (2021). Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6), 1267–1317.
Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI conference on artificial intelligence (pp.3215–3222).
Hsu, K.-C., Ren, A. Z., Nguyen, D. P., Majumdar, A., & Fisac, J. F. (2023). Sim-to-lab-toreal: Safe reinforcement learning with shielding and generalization guarantees. Artificial Intelligence, 314, 103811.
Kumara, A., Gupta, A., & Levine, S. (2020). Discor: Corrective feedback in reinforcement learning via distribution correction. In Proceedings of the 34th conference on neural information processing systems (pp. 18560–18572).
Lee, K., Laskin, M., Srinivas, A., & Abbeel, P. (2021). Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In Proceedings of the 38th international conference on machine learning (pp. 6131–6141).
Lee, S., Seo, Y., Lee, K., Abbeel, P., & Shin, J. (2022). Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Proceedings of the fifth conference on robot learning (pp. 1702–1712).
Leottau, D. L., del Solar, J. R., & Babuška, R. (2018). Decentralized reinforcement learning of robot behaviors. Artificial Intelligence, 256, 130–159.
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1), 1334–1373.
Lin, L. J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3–4), 293–321.
Liu, Q., Li, L., Tang, Z., & Zhou, D. (2018). Breaking the curse of horizon: infinitehorizon off-policy estimation. In Proceedings of the 32nd conference on neural information processing systems (pp. 5356–5366).
Liu, X., Xue, Z., Pang, J., Jiang, S., Xu, F., & Yu, Y. (2021). Regret minimization experience replay in off-policy reinforcement learning. In Proceedings of the 35th conference on neural information processing systems (pp. 17604–17615).
Mavor-Parker, A., Young, K., Barry, C., & Griffin, L. (2022). How to stay curious while avoiding noisy TVS using aleatoric uncertainty estimation. In Proceedings of the 39th international conference on machine learning (pp. 15220–15240).
McKnight, P. E., & Najab, J. (2010). Mann–Whitney U test. The Corsini Encyclopedia of Psychology, 1–1.
Miki, T., Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., & Hutter, M. (2022). Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics, 7(62).
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. In Proceedings of the 30th conference on neural information processing systems (pp. 1054–1062).
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), 5847–5861.
Novati, G., & Koumoutsakos, P. (2019). Remember and forget for experience replay. In Proceedings of the 36th international conference on machine learning (pp. 4851–4860).
Oh, J., Guo, Y., Singh, S., & Lee, H. (2018). Self-imitation learning. In Proceedings of the 35th international conference on machine learning (pp. 3878–3887).
Osband, I., Blundell, C., Pritzel, A., & Roy, B.V. (2016). Deep exploration via bootstrapped DQN. In Proceedings of the 30th conference on neural information processing systems (pp. 4033–4041).
Precup, D., Sutton, R. S., & Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the seventeenth international conference on machine learning (pp. 759–766).
Riedmiller, M., Springenberg, J.T., Hafner, R., & Heess, N. (2022). Collect & Infera fresh look at data-efficient reinforcement learning. In Proceedings of the fifth conference on robot learning (pp. 1736–1744).
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. In Proceedings of the fourth international conference on learning representations.
Schlegel, M., Chung, W., Graves, D., Qian, J., & White, M. (2019). Importance resampling for off-policy prediction. In Proceedings of the 33rd conference on neural information processing systems (pp. 1797–1807).
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., et al. (2020). Mastering Atari, go, chess and shogi by planning with a learned model. Nature, 588(7839), 604–609.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd international conference on machine learning (pp. 1889–1897).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. CoRR arXiv:1707.06347.
Sinha, S., Song, J., Garg, A., & Ermon, S. (2022). Experience replay with likelihood-free importance weights. In Proceedings of the fourth annual learning for dynamics and control conference (pp. 110–123).
Sootla, A., Cowen-Rivers, A.I., Jafferjee, T.,Wang, Z., Mguni, D.H.,Wang, J., & Ammar, H. (2022). Sauté rl: Almost surely safe reinforcement learning using state augmentation. In Proceedings of the 39th international conference on machine learning (pp. 20423–20443).
Sun, P., Zhou, W., & Li, H. (2020). Attentive experience replay. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 5900–5907).
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. London: MIT Press.
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In Proceedings of the 24th international conference on intelligent robots and systems (pp. 5026–5033).
van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. In Proceedings of the 30th AAAI conference on artificial intelligence (pp. 2094–2100).
Wang, C., Wu, Y., Vuong, Q., & Ross, K. (2020a). Striving for simplicity and performance in off-policy DRL: Output normalization and non-uniform sampling. In Proceedings of the 37th international conference on machine learning (pp. 10070–10080).
Wang, J., Wang, X., Luo, X., Zhang, Z., Wang, W., & Li, Y. (2020b). Sem: Adaptive staged experience access mechanism for reinforcement learning. In Proceedings of the 32nd international conference on tools with artificial intelligence (pp. 1088–1095).
Wu, D., Dong, X., Shen, J., & Hoi, S. C. (2020). Reducing estimation bias via tripletaverage deep deterministic policy gradient. IEEE Transactions on Neural Networks and Learning Systems, 31(11), 4933–4945.
Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari games with limited data. In Proceedings of the 35th conference on neural information processing systems (pp. 25476–25488).
Yu, Y. (2018). Towards sample efficient reinforcement learning. In Proceedings of the 27th international joint conference on artificial intelligence (pp. 5739–5743).
Yuan, M., Pun, M.-O., & Wang, D. (2022). Rényi state entropy maximization for exploration acceleration in reinforcement learning. Artificial Intelligence, 1(1), 1–11.
Zha, D., Lai, K.-H., Zhou, K., & Hu, X. (2019). Experience replay optimization. In Proceedings of the 28th international joint conference on artificial intelligence (pp. 4243–4249).
Zhang, L., Zhang, Z., Pan, Z., Chen, Y., Zhu, J., Wang, Z., et al. (2019). A framework of dual replay buffer: balancing forgetting and generalization in reinforcement learning. In Proceedings of the 2nd workshop on scaling up reinforcement learning (SURL) international joint conference on artificial intelligence (IJCAI).
Zhang, T., Rashidinejad, P., Jiao, J., Tian, Y., Gonzalez, J. E., & Russell, S. (2021). Made: Exploration via maximizing deviation from explored regions. In Proceedings of the 35th conference on neural information processing systems (pp. 9663–9680).
Acknowledgements
We would like to thank the anonymous reviewers for their very constructive comments. This work was supported by the National Key Research and Development Program of China (2020AAA0106100), the National Natural Science Foundation of China (62276160), and the Natural Science Foundation of Shanxi Province, China (202203021211294).
Funding
This work was supported by the National Key Research and Development Program of China (2020AAA0106100), the National Natural Science Foundation of China (62276160), and the Natural Science Foundation of Shanxi Province, China (202203021211294).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Conceptualization: WW, DW, and JL; methodology: WW and DW; formal analysis and investigation: WW, DW, and LL; software and validation: DW and LL; writing—original draft: WW and DW; writing—review and editing: WW, DW, LL, and JL; propositional guidance: JL; supervision: JL; funding acquisition: JL, WW, and LL. All authors discussed the results and contributed to the final manuscript, helping with writing, reviewing and editing. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable. All the experiments in this paper are computer simulations of games and do not involve experiments on animals, plants, or human entities.
Consent to participate
Not applicable.
Consent for publication
Not applicable. The paper does not include data or images that require permissions to be published.
Additional information
Editor: Javier Garcia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Implementation details
Appendix A: Implementation details
The detailed parameter settings are listed in Table 1. The algorithms AER, SEM, ReMERT, LFIW, and RAER/RAER\(^\dagger\) (ours) all contain the full parameters of Agent and SAC.
The replay buffer size \(\left| {\mathcal {D}}_{\textrm{f}}\right|\) of LFIW affects the number of experiences we treat as “on-policiness”. According to LFIW’s previous experience, the performance is relatively stable for \(\left| {\mathcal {D}}_{\textrm{f}}\right| = 1\times 10^{5}\). The hidden network sizes of \(\kappa _{\psi }\) are [128, 128], and the temperature hyperparameter T for self-normalization to the importance weights is 7.5. The normalization is:
ReMERT and RAER\(^\dagger\) maintain uniform parameters with LFIW in calculating the likelihood-free importance weight.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wei, W., Wang, D., Li, L. et al. Re-attentive experience replay in off-policy reinforcement learning. Mach Learn 113, 2327–2349 (2024). https://doi.org/10.1007/s10994-023-06505-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06505-8