ORAD: a new framework of offline Reinforcement Learning with Q-value regularization

Zhang, Longfei; Zhang, Yulong; Liu, Shixuan; Chen, Li; Liang, Xingxing; Cheng, Guangquan; Liu, Zhong

doi:10.1007/s12065-022-00778-z

ORAD: a new framework of offline Reinforcement Learning with Q-value regularization

Special Issue
Published: 01 October 2022

Volume 17, pages 339–347, (2024)
Cite this article

Evolutionary Intelligence Aims and scope Submit manuscript

Longfei Zhang ORCID: orcid.org/0000-0001-8259-5148¹^na1,
Yulong Zhang¹^na1,
Shixuan Liu¹^na1,
Li Chen¹^na1,
Xingxing Liang¹^na1,
Guangquan Cheng¹^na1 &
…
Zhong Liu¹^na1

251 Accesses
Explore all metrics

Abstract

Offline Reinforcement Learning (RL) defines a framework for learning from previously collected static buffer. However, offline RL is prone to approximation errors caused by out-of-distribution (OOD) data and particularly inefficient for pixel-based learning tasks compared with state-based input control methods. Several pioneer efforts have been made to solve this problem; some use pessimistic Q-values approximation for unseen observation while others train a model to simulate the environment to train a model on previously collected data to learn policies. However, these methods require accurate and time-consuming estimation of the Q-values or the environment models. Based on this observation, we present offline RL methods with augmented data (ORAD), a handy but non-trivial extension to offline RL algorithms. We show that simple data augmentations, e.g. random translation and random crop, significantly elevate the performance of the state-of-the-art offline RL algorithms. Besides, we find that regularization of the Q-values can also enhance performance. Extensive experiments on the pixel-based input control-Atari demonstrate the superiority of ORAD over SOTA offline RL methods considering both performance and data efficiency, and reveal that ORAD is more effective for the pixel-based control.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

A Survey of Linear Value Function Approximation in Reinforcement Learning

Offline reinforcement learning with anderson acceleration for robotic tasks

Article 10 January 2022

Data Availability

We provide the data and code at https:// github.com/longfeizhang617/ORAD.

References

Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 33:1179–1191
Google Scholar
Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. In: International conference on machine learning, PMLR, pp. 2052–2062
Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361
An G, Moon S, Kim J-H, Song HO (2021) Uncertainty-based offline reinforcement learning with diversified q-ensemble. Adv Neural Inf Process Syst 34:7436–7447
Google Scholar
Van Dyk DA, Meng X-L (2001) The art of data augmentation. J Comput Graph Stat 10(1):1–50
Article MathSciNet Google Scholar
Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y (2019) Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032
Zhou F, Li J, Xie C, Chen F, Hong L, Sun R, Li Z (2021) Metaaugment: Sample-aware data augmentation policy learning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 11097–11105
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp. 1597–1607
Henaff O (2020) Data-efficient image recognition with contrastive predictive coding. In: International conference on machine learning, PMLR, pp. 4182–4192
Van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv e-prints, 1807
Kostrikov DYAZI, Fergus BAJPR (2021) Improving sample efficiency in model-free reinforcement learning from images
Sermanet P, Lynch C, Chebotar Y, Hsu J, Jang E, Schaal S, Levine S, Brain G (2018) Time-contrastive networks: Self-supervised learning from video. In: 2018 IEEE International conference on robotics and automation (ICRA), IEEE, pp. 1134–1141
Dwibedi D, Tompson J, Lynch C, Sermanet P (2018) Learning actionable representations from visual observations. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, pp. 1577–1584
Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine learning proceedings 1990, Elsevier, pp. 216–224
Srinivas A, Jabri A, Abbeel P, Levine S, Finn C (2018) Universal planning networks: Learning generalizable representations for visuomotor control. In: International conference on machine learning, PMLR, pp. 4732–4741
Lee AX, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741–752
Google Scholar
Kaiser L, Babaeizadeh M, Milos P, Osinski B, Campbell RH, Czechowski K, Erhan D, Finn C, Kozakowski P, Levine S, et al (2019) Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374
Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In: International conference on machine learning, PMLR, pp. 2555–2565
Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603
Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T (2020) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839):604–609
Article CAS PubMed ADS Google Scholar
Kumar A, Fu J, Soh M, Tucker G, Levine S (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. Adv Neural Inf Process Syst 32
Jaques N, Ghandeharioun A, Shen JH, Ferguson C, Lapedriza A, Jones N, Gu S, Picard R (2019) Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456
Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst 34
Liu S, Feng Y, Wu K, Cheng G, Huang J, Liu Z (2021) Graph-attention-based casual discovery with trust region-navigated clipping policy optimization. IEEE Trans Cyber. https://doi.org/10.1109/TCYB.2021.3116762
Article Google Scholar
Kostrikov I, Fergus R, Tompson J, Nachum O (2021) Offline reinforcement learning with fisher divergence critic regularization. In: International conference on machine learning, PMLR, pp. 5774–5783
Fakoor R, Mueller JW, Asadi K, Chaudhari P, Smola AJ (2021) Continuous doubly constrained batch reinforcement learning. Adv Neural Inf Process Syst 34
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Adv Neural Inf Process Syst 33:21810–21823
Google Scholar
Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Adv Neural Inf Process Syst 33:14129–14142
Google Scholar
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp. 1587–1596
Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Adv Neural Inf Process Syst 33:19884–19895
Google Scholar
Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649
Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47:253–279
Article Google Scholar

Download references

Author information

L. Zhang, Y. Zhang, S. Liu, L. Chen, X. Liang, G. Cheng, Z. Liu these authors are contributed equally to this work.

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, 410073, China
Longfei Zhang, Yulong Zhang, Shixuan Liu, Li Chen, Xingxing Liang, Guangquan Cheng & Zhong Liu

Authors

Longfei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yulong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shixuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Li Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xingxing Liang
View author publications
You can also search for this author in PubMed Google Scholar
Guangquan Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Longfei Zhang.

Ethics declarations

Conflicts of interest

This work was supported by the National Natural Science Foundation of China under Grant 71701205. The authors have no competing interests to declare that are relevant to the content of this article. Our work is original, and has neither been previously published nor under consideration for publication elsewhere.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, L., Zhang, Y., Liu, S. et al. ORAD: a new framework of offline Reinforcement Learning with Q-value regularization. Evol. Intel. 17, 339–347 (2024). https://doi.org/10.1007/s12065-022-00778-z

Download citation

Received: 05 September 2022
Revised: 14 September 2022
Accepted: 16 September 2022
Published: 01 October 2022
Issue Date: February 2024
DOI: https://doi.org/10.1007/s12065-022-00778-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ORAD: a new framework of offline Reinforcement Learning with Q-value regularization

Abstract

Access this article

Similar content being viewed by others

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

A Survey of Linear Value Function Approximation in Reinforcement Learning

Offline reinforcement learning with anderson acceleration for robotic tasks

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ORAD: a new framework of offline Reinforcement Learning with Q-value regularization

Abstract

Access this article

Similar content being viewed by others

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

A Survey of Linear Value Function Approximation in Reinforcement Learning

Offline reinforcement learning with anderson acceleration for robotic tasks

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation