Abstract
Offline Reinforcement Learning (RL) defines a framework for learning from previously collected static buffer. However, offline RL is prone to approximation errors caused by out-of-distribution (OOD) data and particularly inefficient for pixel-based learning tasks compared with state-based input control methods. Several pioneer efforts have been made to solve this problem; some use pessimistic Q-values approximation for unseen observation while others train a model to simulate the environment to train a model on previously collected data to learn policies. However, these methods require accurate and time-consuming estimation of the Q-values or the environment models. Based on this observation, we present offline RL methods with augmented data (ORAD), a handy but non-trivial extension to offline RL algorithms. We show that simple data augmentations, e.g. random translation and random crop, significantly elevate the performance of the state-of-the-art offline RL algorithms. Besides, we find that regularization of the Q-values can also enhance performance. Extensive experiments on the pixel-based input control-Atari demonstrate the superiority of ORAD over SOTA offline RL methods considering both performance and data efficiency, and reveal that ORAD is more effective for the pixel-based control.
Similar content being viewed by others
Data Availability
We provide the data and code at https:// github.com/longfeizhang617/ORAD.
References
Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 33:1179–1191
Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. In: International conference on machine learning, PMLR, pp. 2052–2062
Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361
An G, Moon S, Kim J-H, Song HO (2021) Uncertainty-based offline reinforcement learning with diversified q-ensemble. Adv Neural Inf Process Syst 34:7436–7447
Van Dyk DA, Meng X-L (2001) The art of data augmentation. J Comput Graph Stat 10(1):1–50
Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y (2019) Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032
Zhou F, Li J, Xie C, Chen F, Hong L, Sun R, Li Z (2021) Metaaugment: Sample-aware data augmentation policy learning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 11097–11105
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp. 1597–1607
Henaff O (2020) Data-efficient image recognition with contrastive predictive coding. In: International conference on machine learning, PMLR, pp. 4182–4192
Van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv e-prints, 1807
Kostrikov DYAZI, Fergus BAJPR (2021) Improving sample efficiency in model-free reinforcement learning from images
Sermanet P, Lynch C, Chebotar Y, Hsu J, Jang E, Schaal S, Levine S, Brain G (2018) Time-contrastive networks: Self-supervised learning from video. In: 2018 IEEE International conference on robotics and automation (ICRA), IEEE, pp. 1134–1141
Dwibedi D, Tompson J, Lynch C, Sermanet P (2018) Learning actionable representations from visual observations. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, pp. 1577–1584
Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine learning proceedings 1990, Elsevier, pp. 216–224
Srinivas A, Jabri A, Abbeel P, Levine S, Finn C (2018) Universal planning networks: Learning generalizable representations for visuomotor control. In: International conference on machine learning, PMLR, pp. 4732–4741
Lee AX, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741–752
Kaiser L, Babaeizadeh M, Milos P, Osinski B, Campbell RH, Czechowski K, Erhan D, Finn C, Kozakowski P, Levine S, et al (2019) Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374
Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In: International conference on machine learning, PMLR, pp. 2555–2565
Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603
Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T (2020) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839):604–609
Kumar A, Fu J, Soh M, Tucker G, Levine S (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. Adv Neural Inf Process Syst 32
Jaques N, Ghandeharioun A, Shen JH, Ferguson C, Lapedriza A, Jones N, Gu S, Picard R (2019) Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456
Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst 34
Liu S, Feng Y, Wu K, Cheng G, Huang J, Liu Z (2021) Graph-attention-based casual discovery with trust region-navigated clipping policy optimization. IEEE Trans Cyber. https://doi.org/10.1109/TCYB.2021.3116762
Kostrikov I, Fergus R, Tompson J, Nachum O (2021) Offline reinforcement learning with fisher divergence critic regularization. In: International conference on machine learning, PMLR, pp. 5774–5783
Fakoor R, Mueller JW, Asadi K, Chaudhari P, Smola AJ (2021) Continuous doubly constrained batch reinforcement learning. Adv Neural Inf Process Syst 34
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Adv Neural Inf Process Syst 33:21810–21823
Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Adv Neural Inf Process Syst 33:14129–14142
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp. 1587–1596
Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Adv Neural Inf Process Syst 33:19884–19895
Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649
Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47:253–279
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
This work was supported by the National Natural Science Foundation of China under Grant 71701205. The authors have no competing interests to declare that are relevant to the content of this article. Our work is original, and has neither been previously published nor under consideration for publication elsewhere.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, L., Zhang, Y., Liu, S. et al. ORAD: a new framework of offline Reinforcement Learning with Q-value regularization. Evol. Intel. 17, 339–347 (2024). https://doi.org/10.1007/s12065-022-00778-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12065-022-00778-z