Skip to main content
Log in

Re-attentive experience replay in off-policy reinforcement learning

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Experience replay, which stores past samples for reuse, has become a fundamental component of off-policy reinforcement learning. Some pioneering works have indicated that prioritization or reweighting of samples with on-policiness can yield significant performance improvements. However, this method doesn’t pay enough attention to sample diversity, which may result in instability or even long-term performance slumps. In this work, we introduce a novel Re-attention criterion to reevaluate recent experiences, thus benefiting the agent from learning about them. We call this overall algorithm, Re-attentive Experience Replay (RAER). RAER employs a parameter-insensitive dynamic testing technique to enhance the attention of samples generated by policies with promising trends in overall performance. By wisely leveraging diverse samples, RAER fulfills the positive effects of on-policiness while avoiding its potential negative influences. Extensive experiments demonstrate the effectiveness of RAER in improving both performance and stability. Moreover, replacing the on-policiness component of the state-of-the-art approach with RAER can yield significant benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Availability of data and materials

The datasets used in experiments are all free-of-use. We provided the data sources in the references of the paper.

Code availability

The code is available on https://DkING-lv6.github.io/RAER/.

Notes

  1. https://DkING-lv6.github.io/RAER/.

References

  • Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th international conference on machine learning (pp. 22–31).

  • Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A.C., & Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice. In: Proceedings of the 35th conference on neural information processing systems (pp. 29304–29320).

  • Anschel, O., Baram, N., & Shimkin, N. (2017). Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In Proceedings of the 34th international conference on machine learning (pp. 176–185).

  • Csiszár, I. (1964). Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akadémia Matematikai Kutató Intézetének Közleményei, 8, 85–108.

    Google Scholar 

  • Dasagi, V., Bruce, J., Peynot, T., & Leitner, J. (2019). Ctrl-z: recovering from instability in reinforcement learning. CoRR arXiv:1910.03732 .

  • de Bruin, T., Kober, J., Tuyls, K., & Babuska, R. (2015). The importance of experience replay database composition in deep reinforcement learning. In Proceedings of the 29th conference on neural information processing systems.

  • Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., & Clune, J. (2021). First return, then explore. Nature, 590(7847), 580–586.

    Article  Google Scholar 

  • Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In Proceedings of the 35th international conference on machine learning (pp. 1587–1596).

  • Fujimoto, S., Meger, D., & Precup, D. (2020). An equivalence between loss functions and non-uniform sampling in experience replay. In Proceedings of the 34th conference on neural information processing systems (pp. 14219–14230).

  • Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th international conference on machine learning (pp. 1861–1870).

  • Han, S., & Sung, Y. (2021). A max-min entropy framework for reinforcement learning. In Proceedings of the 35th conference on neural information processing systems (pp. 25732–25745).

  • Hanna, J. P., Niekum, S., & Stone, P. (2021). Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6), 1267–1317.

    Article  MathSciNet  Google Scholar 

  • Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI conference on artificial intelligence (pp.3215–3222).

  • Hsu, K.-C., Ren, A. Z., Nguyen, D. P., Majumdar, A., & Fisac, J. F. (2023). Sim-to-lab-toreal: Safe reinforcement learning with shielding and generalization guarantees. Artificial Intelligence, 314, 103811.

    Article  Google Scholar 

  • Kumara, A., Gupta, A., & Levine, S. (2020). Discor: Corrective feedback in reinforcement learning via distribution correction. In Proceedings of the 34th conference on neural information processing systems (pp. 18560–18572).

  • Lee, K., Laskin, M., Srinivas, A., & Abbeel, P. (2021). Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In Proceedings of the 38th international conference on machine learning (pp. 6131–6141).

  • Lee, S., Seo, Y., Lee, K., Abbeel, P., & Shin, J. (2022). Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Proceedings of the fifth conference on robot learning (pp. 1702–1712).

  • Leottau, D. L., del Solar, J. R., & Babuška, R. (2018). Decentralized reinforcement learning of robot behaviors. Artificial Intelligence, 256, 130–159.

    Article  MathSciNet  Google Scholar 

  • Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1), 1334–1373.

    MathSciNet  Google Scholar 

  • Lin, L. J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3–4), 293–321.

    Article  Google Scholar 

  • Liu, Q., Li, L., Tang, Z., & Zhou, D. (2018). Breaking the curse of horizon: infinitehorizon off-policy estimation. In Proceedings of the 32nd conference on neural information processing systems (pp. 5356–5366).

  • Liu, X., Xue, Z., Pang, J., Jiang, S., Xu, F., & Yu, Y. (2021). Regret minimization experience replay in off-policy reinforcement learning. In Proceedings of the 35th conference on neural information processing systems (pp. 17604–17615).

  • Mavor-Parker, A., Young, K., Barry, C., & Griffin, L. (2022). How to stay curious while avoiding noisy TVS using aleatoric uncertainty estimation. In Proceedings of the 39th international conference on machine learning (pp. 15220–15240).

  • McKnight, P. E., & Najab, J. (2010). Mann–Whitney U test. The Corsini Encyclopedia of Psychology, 1–1.

  • Miki, T., Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., & Hutter, M. (2022). Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics, 7(62).

  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

    Article  Google Scholar 

  • Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. In Proceedings of the 30th conference on neural information processing systems (pp. 1054–1062).

  • Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), 5847–5861.

    Article  MathSciNet  Google Scholar 

  • Novati, G., & Koumoutsakos, P. (2019). Remember and forget for experience replay. In Proceedings of the 36th international conference on machine learning (pp. 4851–4860).

  • Oh, J., Guo, Y., Singh, S., & Lee, H. (2018). Self-imitation learning. In Proceedings of the 35th international conference on machine learning (pp. 3878–3887).

  • Osband, I., Blundell, C., Pritzel, A., & Roy, B.V. (2016). Deep exploration via bootstrapped DQN. In Proceedings of the 30th conference on neural information processing systems (pp. 4033–4041).

  • Precup, D., Sutton, R. S., & Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the seventeenth international conference on machine learning (pp. 759–766).

  • Riedmiller, M., Springenberg, J.T., Hafner, R., & Heess, N. (2022). Collect & Infera fresh look at data-efficient reinforcement learning. In Proceedings of the fifth conference on robot learning (pp. 1736–1744).

  • Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. In Proceedings of the fourth international conference on learning representations.

  • Schlegel, M., Chung, W., Graves, D., Qian, J., & White, M. (2019). Importance resampling for off-policy prediction. In Proceedings of the 33rd conference on neural information processing systems (pp. 1797–1807).

  • Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., et al. (2020). Mastering Atari, go, chess and shogi by planning with a learned model. Nature, 588(7839), 604–609.

    Article  Google Scholar 

  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd international conference on machine learning (pp. 1889–1897).

  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. CoRR arXiv:1707.06347.

  • Sinha, S., Song, J., Garg, A., & Ermon, S. (2022). Experience replay with likelihood-free importance weights. In Proceedings of the fourth annual learning for dynamics and control conference (pp. 110–123).

  • Sootla, A., Cowen-Rivers, A.I., Jafferjee, T.,Wang, Z., Mguni, D.H.,Wang, J., & Ammar, H. (2022). Sauté rl: Almost surely safe reinforcement learning using state augmentation. In Proceedings of the 39th international conference on machine learning (pp. 20423–20443).

  • Sun, P., Zhou, W., & Li, H. (2020). Attentive experience replay. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 5900–5907).

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. London: MIT Press.

    Google Scholar 

  • Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In Proceedings of the 24th international conference on intelligent robots and systems (pp. 5026–5033).

  • van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. In Proceedings of the 30th AAAI conference on artificial intelligence (pp. 2094–2100).

  • Wang, C., Wu, Y., Vuong, Q., & Ross, K. (2020a). Striving for simplicity and performance in off-policy DRL: Output normalization and non-uniform sampling. In Proceedings of the 37th international conference on machine learning (pp. 10070–10080).

  • Wang, J., Wang, X., Luo, X., Zhang, Z., Wang, W., & Li, Y. (2020b). Sem: Adaptive staged experience access mechanism for reinforcement learning. In Proceedings of the 32nd international conference on tools with artificial intelligence (pp. 1088–1095).

  • Wu, D., Dong, X., Shen, J., & Hoi, S. C. (2020). Reducing estimation bias via tripletaverage deep deterministic policy gradient. IEEE Transactions on Neural Networks and Learning Systems, 31(11), 4933–4945.

    Article  MathSciNet  Google Scholar 

  • Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari games with limited data. In Proceedings of the 35th conference on neural information processing systems (pp. 25476–25488).

  • Yu, Y. (2018). Towards sample efficient reinforcement learning. In Proceedings of the 27th international joint conference on artificial intelligence (pp. 5739–5743).

  • Yuan, M., Pun, M.-O., & Wang, D. (2022). Rényi state entropy maximization for exploration acceleration in reinforcement learning. Artificial Intelligence, 1(1), 1–11.

    Google Scholar 

  • Zha, D., Lai, K.-H., Zhou, K., & Hu, X. (2019). Experience replay optimization. In Proceedings of the 28th international joint conference on artificial intelligence (pp. 4243–4249).

  • Zhang, L., Zhang, Z., Pan, Z., Chen, Y., Zhu, J., Wang, Z., et al. (2019). A framework of dual replay buffer: balancing forgetting and generalization in reinforcement learning. In Proceedings of the 2nd workshop on scaling up reinforcement learning (SURL) international joint conference on artificial intelligence (IJCAI).

  • Zhang, T., Rashidinejad, P., Jiao, J., Tian, Y., Gonzalez, J. E., & Russell, S. (2021). Made: Exploration via maximizing deviation from explored regions. In Proceedings of the 35th conference on neural information processing systems (pp. 9663–9680).

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their very constructive comments. This work was supported by the National Key Research and Development Program of China (2020AAA0106100), the National Natural Science Foundation of China (62276160), and the Natural Science Foundation of Shanxi Province, China (202203021211294).

Funding

This work was supported by the National Key Research and Development Program of China (2020AAA0106100), the National Natural Science Foundation of China (62276160), and the Natural Science Foundation of Shanxi Province, China (202203021211294).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Conceptualization: WW, DW, and JL; methodology: WW and DW; formal analysis and investigation: WW, DW, and LL; software and validation: DW and LL; writing—original draft: WW and DW; writing—review and editing: WW, DW, LL, and JL; propositional guidance: JL; supervision: JL; funding acquisition: JL, WW, and LL. All authors discussed the results and contributed to the final manuscript, helping with writing, reviewing and editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jiye Liang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable. All the experiments in this paper are computer simulations of games and do not involve experiments on animals, plants, or human entities.

Consent to participate

Not applicable.

Consent for publication

Not applicable. The paper does not include data or images that require permissions to be published.

Additional information

Editor: Javier Garcia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Implementation details

Appendix A: Implementation details

Table 1 Hyperparameters for continuous control tasks

The detailed parameter settings are listed in Table 1. The algorithms AER, SEM, ReMERT, LFIW, and RAER/RAER\(^\dagger\) (ours) all contain the full parameters of Agent and SAC.

The replay buffer size \(\left| {\mathcal {D}}_{\textrm{f}}\right|\) of LFIW affects the number of experiences we treat as “on-policiness”. According to LFIW’s previous experience, the performance is relatively stable for \(\left| {\mathcal {D}}_{\textrm{f}}\right| = 1\times 10^{5}\). The hidden network sizes of \(\kappa _{\psi }\) are [128, 128], and the temperature hyperparameter T for self-normalization to the importance weights is 7.5. The normalization is:

$$\begin{aligned} {\tilde{\kappa }}_{\psi }(s, a):=\frac{\kappa _{\psi }(s, a)^{1 / T}}{{\mathbb {E}}_{{\mathcal {D}}_{\textrm{s}}}\left[ \kappa _{\psi }(s, a)^{1 / T}\right] }. \end{aligned}$$

ReMERT and RAER\(^\dagger\) maintain uniform parameters with LFIW in calculating the likelihood-free importance weight.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, W., Wang, D., Li, L. et al. Re-attentive experience replay in off-policy reinforcement learning. Mach Learn 113, 2327–2349 (2024). https://doi.org/10.1007/s10994-023-06505-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06505-8

Keywords

Navigation