Re-attentive experience replay in off-policy reinforcement learning

Wei, Wei; Wang, Da; Li, Lin; Liang, Jiye

doi:10.1007/s10994-023-06505-8

Re-attentive experience replay in off-policy reinforcement learning

Published: 22 February 2024

Volume 113, pages 2327–2349, (2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

Wei Wei¹,
Da Wang¹,
Lin Li¹ &
…
Jiye Liang ORCID: orcid.org/0000-0001-5887-9327¹

314 Accesses
Explore all metrics

Abstract

Experience replay, which stores past samples for reuse, has become a fundamental component of off-policy reinforcement learning. Some pioneering works have indicated that prioritization or reweighting of samples with on-policiness can yield significant performance improvements. However, this method doesn’t pay enough attention to sample diversity, which may result in instability or even long-term performance slumps. In this work, we introduce a novel Re-attention criterion to reevaluate recent experiences, thus benefiting the agent from learning about them. We call this overall algorithm, Re-attentive Experience Replay (RAER). RAER employs a parameter-insensitive dynamic testing technique to enhance the attention of samples generated by policies with promising trends in overall performance. By wisely leveraging diverse samples, RAER fulfills the positive effects of on-policiness while avoiding its potential negative influences. Extensive experiments demonstrate the effectiveness of RAER in improving both performance and stability. Moreover, replacing the on-policiness component of the state-of-the-art approach with RAER can yield significant benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Balanced prioritized experience replay in off-policy reinforcement learning

Article 18 May 2024

Lucid dreaming for experience replay: refreshing past states with the current policy

Article 25 May 2021

A General Unbiased Training Framework for Deep Reinforcement Learning

Availability of data and materials

The datasets used in experiments are all free-of-use. We provided the data sources in the references of the paper.

Code availability

The code is available on https://DkING-lv6.github.io/RAER/.

Notes

https://DkING-lv6.github.io/RAER/.

References

Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th international conference on machine learning (pp. 22–31).
Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A.C., & Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice. In: Proceedings of the 35th conference on neural information processing systems (pp. 29304–29320).
Anschel, O., Baram, N., & Shimkin, N. (2017). Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In Proceedings of the 34th international conference on machine learning (pp. 176–185).
Csiszár, I. (1964). Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akadémia Matematikai Kutató Intézetének Közleményei, 8, 85–108.
Google Scholar
Dasagi, V., Bruce, J., Peynot, T., & Leitner, J. (2019). Ctrl-z: recovering from instability in reinforcement learning. CoRR arXiv:1910.03732 .
de Bruin, T., Kober, J., Tuyls, K., & Babuska, R. (2015). The importance of experience replay database composition in deep reinforcement learning. In Proceedings of the 29th conference on neural information processing systems.
Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., & Clune, J. (2021). First return, then explore. Nature, 590(7847), 580–586.
Article Google Scholar
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In Proceedings of the 35th international conference on machine learning (pp. 1587–1596).
Fujimoto, S., Meger, D., & Precup, D. (2020). An equivalence between loss functions and non-uniform sampling in experience replay. In Proceedings of the 34th conference on neural information processing systems (pp. 14219–14230).
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th international conference on machine learning (pp. 1861–1870).
Han, S., & Sung, Y. (2021). A max-min entropy framework for reinforcement learning. In Proceedings of the 35th conference on neural information processing systems (pp. 25732–25745).
Hanna, J. P., Niekum, S., & Stone, P. (2021). Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6), 1267–1317.
Article MathSciNet Google Scholar
Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI conference on artificial intelligence (pp.3215–3222).
Hsu, K.-C., Ren, A. Z., Nguyen, D. P., Majumdar, A., & Fisac, J. F. (2023). Sim-to-lab-toreal: Safe reinforcement learning with shielding and generalization guarantees. Artificial Intelligence, 314, 103811.
Article Google Scholar
Kumara, A., Gupta, A., & Levine, S. (2020). Discor: Corrective feedback in reinforcement learning via distribution correction. In Proceedings of the 34th conference on neural information processing systems (pp. 18560–18572).
Lee, K., Laskin, M., Srinivas, A., & Abbeel, P. (2021). Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In Proceedings of the 38th international conference on machine learning (pp. 6131–6141).
Lee, S., Seo, Y., Lee, K., Abbeel, P., & Shin, J. (2022). Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Proceedings of the fifth conference on robot learning (pp. 1702–1712).
Leottau, D. L., del Solar, J. R., & Babuška, R. (2018). Decentralized reinforcement learning of robot behaviors. Artificial Intelligence, 256, 130–159.
Article MathSciNet Google Scholar
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1), 1334–1373.
MathSciNet Google Scholar
Lin, L. J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3–4), 293–321.
Article Google Scholar
Liu, Q., Li, L., Tang, Z., & Zhou, D. (2018). Breaking the curse of horizon: infinitehorizon off-policy estimation. In Proceedings of the 32nd conference on neural information processing systems (pp. 5356–5366).
Liu, X., Xue, Z., Pang, J., Jiang, S., Xu, F., & Yu, Y. (2021). Regret minimization experience replay in off-policy reinforcement learning. In Proceedings of the 35th conference on neural information processing systems (pp. 17604–17615).
Mavor-Parker, A., Young, K., Barry, C., & Griffin, L. (2022). How to stay curious while avoiding noisy TVS using aleatoric uncertainty estimation. In Proceedings of the 39th international conference on machine learning (pp. 15220–15240).
McKnight, P. E., & Najab, J. (2010). Mann–Whitney U test. The Corsini Encyclopedia of Psychology, 1–1.
Miki, T., Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., & Hutter, M. (2022). Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics, 7(62).
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Article Google Scholar
Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. In Proceedings of the 30th conference on neural information processing systems (pp. 1054–1062).
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), 5847–5861.
Article MathSciNet Google Scholar
Novati, G., & Koumoutsakos, P. (2019). Remember and forget for experience replay. In Proceedings of the 36th international conference on machine learning (pp. 4851–4860).
Oh, J., Guo, Y., Singh, S., & Lee, H. (2018). Self-imitation learning. In Proceedings of the 35th international conference on machine learning (pp. 3878–3887).
Osband, I., Blundell, C., Pritzel, A., & Roy, B.V. (2016). Deep exploration via bootstrapped DQN. In Proceedings of the 30th conference on neural information processing systems (pp. 4033–4041).
Precup, D., Sutton, R. S., & Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the seventeenth international conference on machine learning (pp. 759–766).
Riedmiller, M., Springenberg, J.T., Hafner, R., & Heess, N. (2022). Collect & Infera fresh look at data-efficient reinforcement learning. In Proceedings of the fifth conference on robot learning (pp. 1736–1744).
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. In Proceedings of the fourth international conference on learning representations.
Schlegel, M., Chung, W., Graves, D., Qian, J., & White, M. (2019). Importance resampling for off-policy prediction. In Proceedings of the 33rd conference on neural information processing systems (pp. 1797–1807).
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., et al. (2020). Mastering Atari, go, chess and shogi by planning with a learned model. Nature, 588(7839), 604–609.
Article Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd international conference on machine learning (pp. 1889–1897).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. CoRR arXiv:1707.06347.
Sinha, S., Song, J., Garg, A., & Ermon, S. (2022). Experience replay with likelihood-free importance weights. In Proceedings of the fourth annual learning for dynamics and control conference (pp. 110–123).
Sootla, A., Cowen-Rivers, A.I., Jafferjee, T.,Wang, Z., Mguni, D.H.,Wang, J., & Ammar, H. (2022). Sauté rl: Almost surely safe reinforcement learning using state augmentation. In Proceedings of the 39th international conference on machine learning (pp. 20423–20443).
Sun, P., Zhou, W., & Li, H. (2020). Attentive experience replay. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 5900–5907).
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. London: MIT Press.
Google Scholar
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In Proceedings of the 24th international conference on intelligent robots and systems (pp. 5026–5033).
van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. In Proceedings of the 30th AAAI conference on artificial intelligence (pp. 2094–2100).
Wang, C., Wu, Y., Vuong, Q., & Ross, K. (2020a). Striving for simplicity and performance in off-policy DRL: Output normalization and non-uniform sampling. In Proceedings of the 37th international conference on machine learning (pp. 10070–10080).
Wang, J., Wang, X., Luo, X., Zhang, Z., Wang, W., & Li, Y. (2020b). Sem: Adaptive staged experience access mechanism for reinforcement learning. In Proceedings of the 32nd international conference on tools with artificial intelligence (pp. 1088–1095).
Wu, D., Dong, X., Shen, J., & Hoi, S. C. (2020). Reducing estimation bias via tripletaverage deep deterministic policy gradient. IEEE Transactions on Neural Networks and Learning Systems, 31(11), 4933–4945.
Article MathSciNet Google Scholar
Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari games with limited data. In Proceedings of the 35th conference on neural information processing systems (pp. 25476–25488).
Yu, Y. (2018). Towards sample efficient reinforcement learning. In Proceedings of the 27th international joint conference on artificial intelligence (pp. 5739–5743).
Yuan, M., Pun, M.-O., & Wang, D. (2022). Rényi state entropy maximization for exploration acceleration in reinforcement learning. Artificial Intelligence, 1(1), 1–11.
Google Scholar
Zha, D., Lai, K.-H., Zhou, K., & Hu, X. (2019). Experience replay optimization. In Proceedings of the 28th international joint conference on artificial intelligence (pp. 4243–4249).
Zhang, L., Zhang, Z., Pan, Z., Chen, Y., Zhu, J., Wang, Z., et al. (2019). A framework of dual replay buffer: balancing forgetting and generalization in reinforcement learning. In Proceedings of the 2nd workshop on scaling up reinforcement learning (SURL) international joint conference on artificial intelligence (IJCAI).
Zhang, T., Rashidinejad, P., Jiao, J., Tian, Y., Gonzalez, J. E., & Russell, S. (2021). Made: Exploration via maximizing deviation from explored regions. In Proceedings of the 35th conference on neural information processing systems (pp. 9663–9680).

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their very constructive comments. This work was supported by the National Key Research and Development Program of China (2020AAA0106100), the National Natural Science Foundation of China (62276160), and the Natural Science Foundation of Shanxi Province, China (202203021211294).

Funding

This work was supported by the National Key Research and Development Program of China (2020AAA0106100), the National Natural Science Foundation of China (62276160), and the Natural Science Foundation of Shanxi Province, China (202203021211294).

Author information

Authors and Affiliations

Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China
Wei Wei, Da Wang, Lin Li & Jiye Liang

Authors

Wei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Da Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiye Liang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Conceptualization: WW, DW, and JL; methodology: WW and DW; formal analysis and investigation: WW, DW, and LL; software and validation: DW and LL; writing—original draft: WW and DW; writing—review and editing: WW, DW, LL, and JL; propositional guidance: JL; supervision: JL; funding acquisition: JL, WW, and LL. All authors discussed the results and contributed to the final manuscript, helping with writing, reviewing and editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jiye Liang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable. All the experiments in this paper are computer simulations of games and do not involve experiments on animals, plants, or human entities.

Consent to participate

Not applicable.

Consent for publication

Not applicable. The paper does not include data or images that require permissions to be published.

Additional information

Editor: Javier Garcia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Implementation details

Table 1 Hyperparameters for continuous control tasks

Full size table

The detailed parameter settings are listed in Table 1. The algorithms AER, SEM, ReMERT, LFIW, and RAER/RAER$^\dagger$ (ours) all contain the full parameters of Agent and SAC.

The replay buffer size $\left| {\mathcal {D}}_{\textrm{f}}\right|$ of LFIW affects the number of experiences we treat as “on-policiness”. According to LFIW’s previous experience, the performance is relatively stable for $\left| {\mathcal {D}}_{\textrm{f}}\right| = 1\times 10^{5}$. The hidden network sizes of $\kappa _{\psi }$ are [128, 128], and the temperature hyperparameter T for self-normalization to the importance weights is 7.5. The normalization is:

$$\begin{aligned} {\tilde{\kappa }}_{\psi }(s, a):=\frac{\kappa _{\psi }(s, a)^{1 / T}}{{\mathbb {E}}_{{\mathcal {D}}_{\textrm{s}}}\left[ \kappa _{\psi }(s, a)^{1 / T}\right] }. \end{aligned}$$

ReMERT and RAER$^\dagger$ maintain uniform parameters with LFIW in calculating the likelihood-free importance weight.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wei, W., Wang, D., Li, L. et al. Re-attentive experience replay in off-policy reinforcement learning. Mach Learn 113, 2327–2349 (2024). https://doi.org/10.1007/s10994-023-06505-8

Download citation

Received: 09 May 2023
Revised: 05 September 2023
Accepted: 16 December 2023
Published: 22 February 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s10994-023-06505-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Re-attentive experience replay in off-policy reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Balanced prioritized experience replay in off-policy reinforcement learning

Lucid dreaming for experience replay: refreshing past states with the current policy

A General Unbiased Training Framework for Deep Reinforcement Learning

Availability of data and materials

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix A: Implementation details

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Re-attentive experience replay in off-policy reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Balanced prioritized experience replay in off-policy reinforcement learning

Lucid dreaming for experience replay: refreshing past states with the current policy

A General Unbiased Training Framework for Deep Reinforcement Learning

Availability of data and materials

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix A: Implementation details

Appendix A: Implementation details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation