Skip to main content
Log in

ORAD: a new framework of offline Reinforcement Learning with Q-value regularization

  • Special Issue
  • Published:
Evolutionary Intelligence Aims and scope Submit manuscript

Abstract

Offline Reinforcement Learning (RL) defines a framework for learning from previously collected static buffer. However, offline RL is prone to approximation errors caused by out-of-distribution (OOD) data and particularly inefficient for pixel-based learning tasks compared with state-based input control methods. Several pioneer efforts have been made to solve this problem; some use pessimistic Q-values approximation for unseen observation while others train a model to simulate the environment to train a model on previously collected data to learn policies. However, these methods require accurate and time-consuming estimation of the Q-values or the environment models. Based on this observation, we present offline RL methods with augmented data (ORAD), a handy but non-trivial extension to offline RL algorithms. We show that simple data augmentations, e.g. random translation and random crop, significantly elevate the performance of the state-of-the-art offline RL algorithms. Besides, we find that regularization of the Q-values can also enhance performance. Extensive experiments on the pixel-based input control-Atari demonstrate the superiority of ORAD over SOTA offline RL methods considering both performance and data efficiency, and reveal that ORAD is more effective for the pixel-based control.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

We provide the data and code at https:// github.com/longfeizhang617/ORAD.

References

  1. Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 33:1179–1191

    Google Scholar 

  2. Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. In: International conference on machine learning, PMLR, pp. 2052–2062

  3. Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361

  4. An G, Moon S, Kim J-H, Song HO (2021) Uncertainty-based offline reinforcement learning with diversified q-ensemble. Adv Neural Inf Process Syst 34:7436–7447

    Google Scholar 

  5. Van Dyk DA, Meng X-L (2001) The art of data augmentation. J Comput Graph Stat 10(1):1–50

    Article  MathSciNet  Google Scholar 

  6. Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y (2019) Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032

  7. Zhou F, Li J, Xie C, Chen F, Hong L, Sun R, Li Z (2021) Metaaugment: Sample-aware data augmentation policy learning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 11097–11105

  8. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25

  9. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp. 1597–1607

  10. Henaff O (2020) Data-efficient image recognition with contrastive predictive coding. In: International conference on machine learning, PMLR, pp. 4182–4192

  11. Van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv e-prints, 1807

  12. Kostrikov DYAZI, Fergus BAJPR (2021) Improving sample efficiency in model-free reinforcement learning from images

  13. Sermanet P, Lynch C, Chebotar Y, Hsu J, Jang E, Schaal S, Levine S, Brain G (2018) Time-contrastive networks: Self-supervised learning from video. In: 2018 IEEE International conference on robotics and automation (ICRA), IEEE, pp. 1134–1141

  14. Dwibedi D, Tompson J, Lynch C, Sermanet P (2018) Learning actionable representations from visual observations. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, pp. 1577–1584

  15. Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine learning proceedings 1990, Elsevier, pp. 216–224

  16. Srinivas A, Jabri A, Abbeel P, Levine S, Finn C (2018) Universal planning networks: Learning generalizable representations for visuomotor control. In: International conference on machine learning, PMLR, pp. 4732–4741

  17. Lee AX, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741–752

    Google Scholar 

  18. Kaiser L, Babaeizadeh M, Milos P, Osinski B, Campbell RH, Czechowski K, Erhan D, Finn C, Kozakowski P, Levine S, et al (2019) Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374

  19. Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In: International conference on machine learning, PMLR, pp. 2555–2565

  20. Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603

  21. Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T (2020) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839):604–609

    Article  CAS  PubMed  ADS  Google Scholar 

  22. Kumar A, Fu J, Soh M, Tucker G, Levine S (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. Adv Neural Inf Process Syst 32

  23. Jaques N, Ghandeharioun A, Shen JH, Ferguson C, Lapedriza A, Jones N, Gu S, Picard R (2019) Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456

  24. Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst 34

  25. Liu S, Feng Y, Wu K, Cheng G, Huang J, Liu Z (2021) Graph-attention-based casual discovery with trust region-navigated clipping policy optimization. IEEE Trans Cyber. https://doi.org/10.1109/TCYB.2021.3116762

    Article  Google Scholar 

  26. Kostrikov I, Fergus R, Tompson J, Nachum O (2021) Offline reinforcement learning with fisher divergence critic regularization. In: International conference on machine learning, PMLR, pp. 5774–5783

  27. Fakoor R, Mueller JW, Asadi K, Chaudhari P, Smola AJ (2021) Continuous doubly constrained batch reinforcement learning. Adv Neural Inf Process Syst 34

  28. Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Adv Neural Inf Process Syst 33:21810–21823

    Google Scholar 

  29. Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Adv Neural Inf Process Syst 33:14129–14142

    Google Scholar 

  30. Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp. 1587–1596

  31. Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Adv Neural Inf Process Syst 33:19884–19895

    Google Scholar 

  32. Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649

  33. Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47:253–279

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Longfei Zhang.

Ethics declarations

Conflicts of interest

This work was supported by the National Natural Science Foundation of China under Grant 71701205. The authors have no competing interests to declare that are relevant to the content of this article. Our work is original, and has neither been previously published nor under consideration for publication elsewhere.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, L., Zhang, Y., Liu, S. et al. ORAD: a new framework of offline Reinforcement Learning with Q-value regularization. Evol. Intel. 17, 339–347 (2024). https://doi.org/10.1007/s12065-022-00778-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12065-022-00778-z

Keywords

Navigation