research-article

DCAC: Reducing Unnecessary Conservatism in Offline-to-online Reinforcement Learning

Authors:

Dongxiang Chen,

Ying WenAuthors Info & Claims

DAI '23: Proceedings of the Fifth International Conference on Distributed Artificial Intelligence

Article No.: 9, Pages 1 - 12

https://doi.org/10.1145/3627676.3627677

Published: 30 December 2023 Publication History

Abstract

Recent advancements in offline reinforcement learning (RL) have facilitated the training of powerful agents using fixed datasets exclusively. Despite this, the quality of a dataset plays a critical role in determining an agent’s performance, and high-quality datasets are often scarce. This scarcity necessitates the enhancement of agents through subsequent environmental interactions. Particularly, the state-action distribution shift may exert a potentially detrimental effect on well-initialized policies, thus impeding the straightforward application of off-policy RL algorithms to policies trained offline. Predominant offline-to-online RL approaches are typically founded on conservatism, a characteristic that may inadvertently confine the asymptotic performance. In response, we propose a method referred to as Dynamically Constrained Actor-Critic (DCAC), grounded in the mathematical form of dynamically constrained policy optimization. This innovative method enables judicious adjustments to the constraints on policy optimization in accordance with a specified rule, thus stabilizing the initial online learning stage and reducing undue conservatism that restricts asymptotic performance. Through comprehensive experimentation across diverse locomotion tasks, we have ascertained that our method successfully improves the policies trained offline with various datasets via subsequent online environmental interactions. The empirical results substantiate that our method mitigates the harmful effects of distribution shift and consistently attains superior asymptotic performance in comparison to prior works.

References

[1]

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. 2018. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920 (2018).

[2]

Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. 2022. Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. arXiv preprint arXiv:2202.11566 (2022).

[3]

Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. 2021. Goal-conditioned reinforcement learning with imagined subgoals. In International Conference on Machine Learning. PMLR, 1430–1440.

[4]

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems 34 (2021), 15084–15097.

[5]

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. 2020. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020).

[6]

Scott Fujimoto and Shixiang Shane Gu. 2021. A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems 34 (2021).

[7]

Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforcement learning without exploration. In International conference on machine learning. PMLR, 2052–2062.

[8]

Siyuan Guo, Yanchao Sun, Jifeng Hu, Sili Huang, Hechang Chen, Haiyin Piao, Lichao Sun, and Yi Chang. 2023. A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning. arXiv preprint arXiv:2306.07541 (2023).

[9]

Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. 2020. Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning. In Conference on Robot Learning. PMLR, 1025–1037.

[10]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861–1870.

[11]

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. 2020. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems 33 (2020), 21810–21823.

[12]

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. 2019. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems 32 (2019).

[13]

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems 33 (2020), 1179–1191.

[14]

Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. 2022. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning. PMLR, 1702–1712.

[15]

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020).

[16]

Yitian Liu and Zhouhui Lian. 2021. FontRL: Chinese Font Synthesis via Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2198–2206.

[17]

Yihuan Mao, Chao Wang, Bin Wang, and Chongjie Zhang. 2022. MOORe: Model-based Offline-to-Online Reinforcement Learning. arXiv preprint arXiv:2201.10070 (2022).

[18]

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. 2020. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359 (2020).

[19]

Karl Pertsch, Youngwoon Lee, and Joseph Lim. 2021. Accelerating Reinforcement Learning with Learned Skill Priors. In Conference on Robot Learning. PMLR, 188–204.

[20]

Karl Pertsch, Youngwoon Lee, Yue Wu, and Joseph J Lim. 2022. Guided Reinforcement Learning with Learned Skills. In Conference on Robot Learning. PMLR, 729–739.

[21]

Jan Peters, Katharina Mulling, and Yasemin Altun. 2010. Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artificial Intelligence.

Digital Library

[22]

Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. 2022. A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems. arXiv preprint arXiv:2203.01387 (2022).

[23]

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. 2017. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087 (2017).

[24]

Richard S Sutton, Andrew G Barto, 1998. Introduction to reinforcement learning. (1998).

[25]

Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 5026–5033.

[26]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[27]

Zhicheng Wang, Biwei Huang, Shikui Tu, Kun Zhang, and Lei Xu. 2021. DeepTrader: A Deep Reinforcement Learning Approach for Risk-Return Balanced Portfolio Management with Market Conditions Embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 643–650.

[28]

Yifan Wu, George Tucker, and Ofir Nachum. 2019. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019).

[29]

Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, and Hanlin Goh. 2021. Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140 (2021).

[30]

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. 2021. Combo: Conservative offline model-based policy optimization. Advances in Neural Information Processing Systems 34 (2021).

[31]

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. 2020. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems 33 (2020), 14129–14142.

[32]

Zishun Yu and Xinhua Zhang. 2023. Actor-Critic Alignment for Offline-to-Online Reinforcement Learning. (2023).

[33]

Andrea Zanette, Martin J Wainwright, and Emma Brunskill. 2021. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems 34 (2021).

[34]

Dawei Zhang, Zhonglong Zheng, Riheng Jia, and Minglu Li. 2021. Visual Tracking via Hierarchical Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3315–3323.

[35]

Kai Zhao, Yi Ma, Jinyi Liu, Yan Zheng, and Zhaopeng Meng. 2023. Ensemble-based Offline-to-Online Reinforcement Learning: From Pessimistic Learning to Optimistic Exploration. arXiv preprint arXiv:2306.06871 (2023).

[36]

Xiangyu Zhao, Changsheng Gu, Haoshenglun Zhang, Xiwang Yang, Xiaobing Liu, Hui Liu, and Jiliang Tang. 2021. Dear: Deep reinforcement learning for online advertising impression in recommender systems. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 750–758.

[37]

Yi Zhao, Rinu Boney, Alexander Ilin, Juho Kannala, and Joni Pajarinen. 2022. Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning. arXiv preprint arXiv:2210.13846 (2022).

[38]

Qinqing Zheng, Amy Zhang, and Aditya Grover. 2022. Online decision transformer. In international conference on machine learning. PMLR, 27042–27059.

Cited By

Yu ZKang SZhang XKiyavash NMooij J(2024)Offline reward perturbation boosts distributional shift in online RLProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702865(4041-4055)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.5555/3702676.3702865

Index Terms

DCAC: Reducing Unnecessary Conservatism in Offline-to-online Reinforcement Learning
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning

Recommendations

Efficient Offline Reinforcement Learning With Relaxed Conservatism
Offline reinforcement learning (RL) aims at learning an optimal policy from a static offline data set, without interacting with the environment. However, the theoretical understanding of the existing offline RL methods needs further studies, among which ...
Reducing reinforcement learning to KWIK online regression

One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large Markov decision processes (MDPs) where compact function approximation has to be used. This paper introduces REKWIRE,...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

DAI '23: Proceedings of the Fifth International Conference on Distributed Artificial Intelligence

November 2023

139 pages

ISBN:9798400708480

DOI:10.1145/3627676

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

DAI '23

DAI '23: The Fifth International Conference on Distributed Artificial Intelligence

November 30 - December 3, 2023

Singapore, Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
77
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)9

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu ZKang SZhang XKiyavash NMooij J(2024)Offline reward perturbation boosts distributional shift in online RLProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702865(4041-4055)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.5555/3702676.3702865

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten