An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

Cheng, Xuebo; Huang, Xiaohui; Huang, Zhichao; Jiang, Nan

doi:10.1007/s10489-024-05806-2

An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

Published: 11 September 2024

Volume 54, pages 12246–12271, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xuebo Cheng¹,
Xiaohui Huang ORCID: orcid.org/0000-0001-7269-4484¹,
Zhichao Huang² &
…
Nan Jiang¹

466 Accesses
Explore all metrics

Abstract

Offline Reinforcement Learning (Offline RL) is able to learn from pre-collected offline data without real-time interaction with the environment by policy regularization via distributional constraints or support set constraints. However, since the policy learned from offline data under the constrains of support set is usually similar to the behavioral policy due to the overly conservative constraints, offline RL confronts challenges in active behavioral exploration. Moreover, without online interaction, policy evaluation becomes prone to inaccuracy, and the learned policy may lack robustness in the presence of sub-optimal state-action pairs or noise in a dataset. In this paper, we propose an Offline-to-Online Reinforcement Learning Approach based on Multi-action Evaluation with Policy Extension(MAERL) for improving the ability of the policy exploration and the effective value evaluation of state-action in offline RL. In MAERL, we develop four modules: (1) in the policy extension module, we design a policy extension method, which uses the online policy to extend the offline policy; (2) in the multi-action evaluation module, we present an adaptive manner to merge the offline and online policies to generate an action of the agent; (3) in the action-oriented module, we learn the action trajectories of the agent from the dataset, mitigating the issue of actions deviating excessively during environmental exploration; (4) to maintain the consistency in the agent’s actions, we propose an action temporally-aligned representation learning method to maintain the trend of actions of agents. This approach ensures that the agent’s actions align with the learned trajectories, preventing significant deviations during exploration. Extensive experiments are conducted on 15 scenarios of the D4RL/mujoco environment. Results demonstrate that our proposed methods achieve the best performance in 12 scenarios and the second-best performance in 3 scenarios compared to state-of-the-art methods. The project’s code can be found at https://github.com/FrankGod111/Policy-Expansion.git

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Goal-conditioned offline reinforcement learning through state space partitioning

Article Open access 05 February 2024

Model-Based Offline Adaptive Policy Optimization with Episodic Memory

BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability and access

The datasets generated or analyzed during this study are available in the D4RL repository, https://github.com/rail-berkeley/d4rl.git . The code can be accessed in URL: https://github.com/Farama-Foundation/D4RL. The article includes a URL address for accessing the code, which is also available for real-time access.

References

Fu J, Kumar A, Nachum O, Tucker G, Levine S (2020) D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219
McDonald MJ, Hadfield-Menell D (2022) Guided imitation of task and motion planning. Proceedings of the 5th conference on robot learning. 164:630–640
Chen X, Yao L, McAuley J, Zhou G, Wang X (2023) Deep reinforcement learning in recommender systems: A survey and new perspectives. Knowl-Based Syst 264:110335
Article Google Scholar
Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: An astounding baseline for recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) workshops
Gupta A, Kumar V, Lynch C, Levine S, Hausman K (2020) Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. Proceedings of the Conference on Robot Learning. 100:1025–1037
Google Scholar
Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. Proceedings of the 36th international conference on machine learning. 97:2052–2062
Schwarzer M, Rajkumar N, Noukhovitch M, Anand A, Charlin L, Hjelm RD, Bachman P, Courville AC (2021) Pretraining representations for data-efficient reinforcement learning. Adv Neural Inf Process Syst 34:12686–12699
Google Scholar
Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of North American chapter of the association for computational linguistics, pp 4171–4186
Campos V, Sprechmann P, Hansen S, Barreto A, Kapturowski S, Vitvitskyi A, Badia AP, Blundell C (2021) Beyond fine-tuning: Transferring behavior in reinforcement learning. Proceedings of the international conference on machine learning 2021 workshop on unsupervised reinforcement learning
Nair A, Gupta A, Dalal M, Levine S (2020) Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359
Nakamoto M, Zhai S, Singh A, Sobol Mark M, Ma Y, Finn C, Kumar A, Levine S (2023) Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Adv Neural Inf Process Syst 36:62244–62269
Google Scholar
Kumar A, Fu J, Soh M, Tucker G, Levine S (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. Adv Neural Inform Process Syst 32
Zhang H, Xu W, Yu H (2023) Policy expansion for bridging offline-to-online reinforcement learning. Proceedings of the eleventh international conference on learning representations
Seo Y, Lee K, James SL, Abbeel P (2022) Reinforcement learning with action-free pre-training from videos. Proceedings of the 39th international conference on machine learning, 162:19561–19579
Son S, Zheng L, Sullivan R, Qiao Y-L, Lin M (2023) Gradient informed proximal policy optimization. Adv Neural Inf Process Syst 36:8788–8814
Google Scholar
Fujimoto S, Chang W-D, Smith E, Gu SS, Precup D, Meger D (2023) For sale: State-action representation learning for deep reinforcement learning. Adv Neural Inf Process Syst 36:61573–61624
Google Scholar
Bhatt A, Palenicek D, Belousov B, Argus M, Amiranashvili A, Brox T, Peters J (2024) Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. Proceedings of the international conference on learning representations (ICLR)
Li R, Shang Z, Zheng C, Li H, Liang Q, Cui Y (2023) Efficient distributional reinforcement learning with kullback-leibler divergence regularization. Appl Intell 53(21):24847–24863
Article Google Scholar
Shang Z, Li R, Zheng C, Li H, Cui Y (2023) Relative entropy regularized sample-efficient reinforcement learning with continuous actions. IEEE Trans Neural Netw Learn Syst pp 1–11
Hiraoka T, Imagawa T, Hashimoto T, Onishi T, Tsuruoka Y (2022) Dropout q-functions for doubly efficient reinforcement learning. Proceedings of the tenth international conference on learning representations, ICLR 2022, Virtual Event, April 25–29 2022
Zhao X, Ding S, An Y, Jia W (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights. Appl Intell 49(2):581–591
Article Google Scholar
Ding S, Zhao X, Xu X, Sun T, Jia W (2019) An effective asynchronous framework for small scale reinforcement learning problems. Appl Intell 49(12):4303–4318
Article Google Scholar
Du X, Chen H, Wang C, Xing Y, Yang J, Yu PS, Chang Y, He L (2024) Robust multi-agent reinforcement learning via bayesian distributional value estimation. Pattern Recogn 145:109917
Article Google Scholar
Ciosek K, Vuong Q, Loftin R, Hofmann K (2019) Better exploration with optimistic actor critic. Adv Neural Inf Process Syst 32:103368
Wu J, Wu H, Qiu Z, Wang J, Long M (2022) Supported policy optimization for offline reinforcement learning. Adv Neural Inf Process Syst 35:31278–31291
Google Scholar
Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst 34:20132–20145
Google Scholar
Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 33:1179–1191
Google Scholar
Kostrikov I, Fergus R, Tompson J, Nachum O (2021) Offline reinforcement learning with fisher divergence critic regularization. Proceedings of the 38th international conference on machine learning, 139:5774–5783
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: model-based offline reinforcement learning. Proceedings of the 34th international conference on neural information processing systems
Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Adv Neural Inf Process Syst 33:14129–14142
Google Scholar
Liu H, Abbeel P (2021) Behavior from the void: Unsupervised active pre-training. Adv Neural Inf Process Syst 34:18459–18473
Google Scholar
Wu J, Wu H, Qiu Z, Wang J, Long M (2022) Supported policy optimization for offline reinforcement learning. Adv Neural Inf Process Syst 35:31278–31291
Google Scholar
Lee S, Seo Y, Lee K, Abbeel P, Shin J (2022) Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. Proceedings of the 5th conference on robot learning, 164:1702–1712
Yang M, Nachum O (2021) Representation matters: Offline pretraining for sequential decision making. Proceedings of the 38th international conference on machine learning. 139:11784–11794
Uchendu I, Xiao T, Lu Y, Zhu B, Yan M, Simon J, Bennice M, Fu C, Ma C, Jiao J, Levine S, Hausman K (2023) Jump-start reinforcement learning. Proceedings of the 40th international conference on machine learning, 202:34556–34583
Kostrikov I, Nair A, Levine S (2021) Offline reinforcement learning with implicit q-learning. Advances in deep reinforcement learning workshop conference on neural information processing systems
Sundhar Ramesh S, Giuseppe Sessa P, Hu Y, Krause A, Bogunovic I (2024) Distributionally robust model-based reinforcement learning with large state spaces. Proceedings of the 27th international conference on artificial intelligence and statistics, 238:100–108
Guo S, Zou L, Chen H, Qu B, Chi H, Yu PS, Chang Y (2024) Sample efficient offline-to-online reinforcement learning. IEEE Trans Knowl Data Eng 36(3):1299–1310
Article Google Scholar
Li P, Tang H, Yang T, Hao X, Sang T, Zheng Y, Hao J, Taylor ME, Tao W, Wang Z (2022) PMIC: Improving multi-agent reinforcement learning with progressive mutual information collaboration. Proceedings of the 39th international conference on machine learning, 162:12979–12997
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th international conference on machine learning, 80:1861–1870

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62462031, in part by the Natural Science Foundation of Jiangxi Province under Grant 20232BAB202018.

Author information

Authors and Affiliations

School of Information and Engineering, East China Jiaotong University, Nanchang, 330000, Jiangxi, China
Xuebo Cheng, Xiaohui Huang & Nan Jiang
JD Intelligent Cities Research, JD iCity, JD Technology, Beijing, 100000, China
Zhichao Huang

Authors

Xuebo Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Xiaohui Huang
View author publications
You can also search for this author inPubMed Google Scholar
Zhichao Huang
View author publications
You can also search for this author inPubMed Google Scholar
Nan Jiang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Xuebo Cheng and Xiaohui Huang were involved in conceptualization, data curation, methodology, software, validation, writing original draft. Zhichao Huang helped in conceptualization, writing, supervision and review. Nan Jiang helped in data curation, supervision.

Corresponding author

Correspondence to Xiaohui Huang.

Ethics declarations

Ethical and informed consent for data used

Relevant ethical guidelines and regulations were followed in conducting this study.

Competing Interests

The authors declare that there are no potential competing interests or conflicting relationships in this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Algorithm

In this section, we provide pseudo-code: Algorithm 1 for MAERL. The framework is essentially divided into two phases: offline training and online training. During the offline training phase, the model randomly selects data from the dataset ($D_{offline}$) for training, optimizing both Q-values and policies. Simultaneously, in the action-oriented module, random samples are utilized to learn the agent’s implicit policy. In the online training phase, the policy learned in the offline phase is retained while an additional online policy is trained. The information from both policies is fused using mutual information to obtain the final $a_{target}$ for interacting with the environment. The model’s learning performance is enhanced through policy extension and multi-action evaluation. The action-oriented process introduces implicit action guidance, contributing to improved exploration capabilities. The algorithm’s central concept involves training a conservative policy during offline learning and subsequently exploring during online learning. Policy extension and multi-action evaluation are employed to organically integrate knowledge.

When IQL is used for offline training followed by SAC for online training, there arises an important question: Can the Q-function trained offline be directly used in online SAC training? The answer is yes: according to many offline-to-online articles such as [10, 11, 13, 31] and the experiments provided in this study, it is actually beneficial to continue using the pretrained Q-function for online training (compared to starting from scratch). Although there may be conflicts due to different update targets, as mentioned in Cal-QL [11], where the excessively conservative Q-function can lead to sub-optimal actions during online training, exacerbating the performance drop. This paper further discovers that the policy recovery process is the process of the Q-function returning to normal, which is a recovery process from performance drop. Mitigating this phenomenon will bring greater significance to the use of Q-function parameters.

Appendix B: Baseline implementations and experimental details

1.1 B.1 Parameters setting and hyper-parametric studies

The parameter settings for the comparison algorithms (including baselines) mentioned in the experimental part of this paper are shown in the Tables 5, 6, 7, 3, 9, and 10 below. Specifically, for the mujoco task in SPOT’s experiments, we follow previous work and omit the evaluation of the medium expert and expert datasets, where the offline pre-trained agents can already achieve expert-level performance without further fine-tuning. The experiments are repeated with 5 different random seeds, and the authors have provided checkpoints for the pre-trained agents in the source code, so it is sufficient for us to directly employ these checkpoints for online tuning. After the experiments, we normalized the states in mujoco’s dataset but not in the antmaze dataset.

Table 5 Hyper-parameters of SAC

Full size table

Table 6 Hyper-parameters of CQL

Full size table

Table 7 Hyper-parameters of Cal-QL

Full size table

It is important to mention that unlike SPOT’s similar approach of SUNG, where SPOT utilizes VAE to estimate the density of behavioral policies, SUNG utilizes VAE to estimate the density of state-actions. Therefore, we set the latent dimension as 2 $\times $ (state dim + action dim) instead of 2 $\times $ action dim. For online tuning, we follow the same experimental environment and parameter setting requirements for all baselines. For baseline replication, we strictly follow their official implementation and hyper-parameters reported in the original paper to fine-tune the processing of the TD3+BC, CQL, and SPOT pre-trained mujoco and antmaze datasets. To reflect better fairness, the reintegration methods were removed from the subsequent BT (behavior transfer) and BC (behavior cloning) methods.

Table 8 Hyper-parameters of IQL

Full size table

1.2 B.2 Combining behavior cloning and online reinforcement learning

Behavior Cloning (BC) combines the advantages of supervised learning and reinforcement learning by first using supervised learning techniques to learn a policy from an expert example, and then further improving the policy to adapt to the environment through online fine-tuning. The BC-online part of the sequence of actions taken to perform a task in a given environment learns the action to take in a given state by minimizing the difference between the policy and that of the expert example. In the target environment, the model interacts with the real environment and makes policy improvements based on actual reward signals. The goal of the decision-maker is to find a steady-state policy $\pi : S\longrightarrow \bigtriangleup (A)$ to maximize the cumulative reward:

$$\begin{aligned} V(\pi )\!=\!\mathbb {E} [\sum _{t=0}^{\infty } \gamma ^{t}R(s_{t},a_{t})\mid s_{0}\sim \rho ,a_{t}\sim \pi (\cdot \mid s_{t}),\forall t\ge 0]. \end{aligned}$$

(B1)

Table 9 Hyper-parameters of AWAC

Full size table

Table 10 Hyper-parameters of TD3+BC

Full size table

And, suppose that there is a dataset $D=\left\{ (s_{i},a_{i})\right\} _{i=1}^{m}$ collected by an expert policy $\pi _{E}$ where each state-action pair is generated by the interaction of $\pi _{E}$ and the environment. Since we tend to assume that the expert policy is of high quality, our goal becomes to find a policy $\pi $ to minimize the difference in the value function with the expert policy:

$$\begin{aligned} \min _{\pi } [V(\pi _{E})-V(\pi )]. \end{aligned}$$

(B2)

As:

$$\begin{aligned} & \min _{\pi } \mathbb {E} _{s\sim d_{\pi _{E}}}[D_{KL}(\pi _{E}(\cdot \mid s),\pi (\cdot \mid s))]: =\mathbb {E}_{(s,a)\sim \rho _{\pi _{E}}}\nonumber \\ & \quad \times [\log {\frac{\pi _{E}(a\mid s)}{\pi (a \mid s)} } ] . \end{aligned}$$

(B3)

Here $d_{\pi _{E}}$ and $\rho _{\pi _{E}}$ have the (discounted) state distribution and state-action distribution generated by the policy $\pi _{E}$ , respectively, as defined in the general case:

$$\begin{aligned} d_{\pi }(s)= & \mathbb {E} [\sum _{t=0}^{\infty } \gamma ^{t}\textrm{Pr}(s_{t}\!=\!s)\mid s_{0}\sim \rho ,a_{t}\sim \pi (\cdot \mid s_{t}) ,\forall t\!\ge \! 0 ], \end{aligned}$$

(B4)

$$\begin{aligned} \rho _{\pi }(s,a)= & \mathbb {E} [\sum _{t=0}^{\infty } \gamma ^{t}\textrm{Pr}(s_{t}\!=\!s,a_{t}\!=\!a)\!\mid s_{0}\sim \rho ,a_{t}\sim \pi \nonumber \\ & \times (\cdot \mid s_{t}) ,\forall t\ge 0 ]. \end{aligned}$$

(B5)

1.3 B.3 Combining Behavior Transfer and Online Reinforcement Learning

Behavior Transfer (BT) is an approach initially aimed at machine learning methods that use unsupervised or self-supervised learning and can be pre-trained in conjunction with reinforcement learning. The idea is to transfer the learned behaviors to later learning stages by applying the policy learned in the source environment to the target environment, although fine-tuning the policy is often required due to environmental differences. The training process is illustrated in Fig. 14. A policy alignment or value function transfer approach can be used to find the most suitable method for the target environment. The composition of BT data is depicted in Fig. 13, where N denotes the number of trajectories, $N_{S}$ denotes the number of sub-optimal policy aspects, $N_{E}$ denotes the number of expert policy aspects, and $\eta $ is the balancing factor:

$$\begin{aligned} N_{S}=(1-\eta )N,N_{E}=\eta N. \end{aligned}$$

(B6)

This BT approach was originally used in discrete action scenarios as well. The value function can be expressed as:

$$\begin{aligned} V(\pi )=\mathbb {E} \left[ \sum _{h=1}^{H} r(s_{h},a_{h})\mid P,\pi \right] , \end{aligned}$$

(B7)

where $\mathcal {H} $ is maximum length of a trajectory.

Here, we apply this concept to establish a baseline within our specific context, drawing inspiration from the approach outlined by Campos et al. (2021). Following the methodology presented in their work, we utilize a Zeta distribution characterized by parameter $a = 2$ to determine the duration of persistent unrolling steps using the offline policy. The number of persistent unroll steps is subject to resampling when not actively engaged in a persistent unroll phase, and this resampling occurs when a randomly sampled number from the interval [0, 1] is less than a predefined threshold $\epsilon $. For our experiments, we set $\epsilon $ to 0.1.

This adaptation enables us to establish a robust baseline in our experimental framework, closely aligning with the principles introduced by Campos et al. in 2021. It provides a systematic approach to determining the duration of persistent unrolling steps and introduces a level of randomness to ensure adaptability during the learning process. A comparison of BC, BT, and fine-tuning experiments based on these approaches is depicted in Fig. 3. The mean normalized scores after learning the policy on two tasks with complementary datasets under different conditions are shown. A higher score indicates better performance. The experimental results indicate that (1) in the noisy expert condition, the performance of MAERL is roughly equivalent to the results of BC+Online but slightly better than the BT+Online approach, and (2) overall, the MAERL algorithm outperforms both BC+Online and BT+Online approaches in tasks under both types of conditions.

1.4 B.4 Parameter description for offline-to-online reinforcement learning

The method aims to improve offline-to-online RL performance and consists of two components:

training an ensemble of critic and policy networks in the offline training phase, whereas the traditional offline approach learns a policy network and a critic network;
balanced playback: balancing the offline-online playback scheme to make an appropriate trade-off between the use of offline and online samples.

We used the code published by the authors. The specific details are as follows:

Training details for offline RL. For the network architecture, we use a 2-layer multi-layer perceptual machine model (MLP) for the value and policy networks (but in the halfcheetah-medium and halfcheetah-medium-replay scenarios we find that a 3-layer MLP network is more effective for training agents). For a given experimental scenario, we satisfy the setup of Kumar et al. while ensuring fairness.
Training details for online RL. For this method, the Adam optimizer is used and the policy learning rate is chosen from 3e-4,3e-5,3e-6, value learning rate of 3e-4.
Training details for balanced replay. For the training density ratio estimation network, in this paper we fix it as a 2-layer MLP network considering the actual experimental situation. we used batch size 256 (i.e., 256 offline samples and 256 online samples), and learning rate 3e-4 for all locomotion experiments.

1.5 B.5 Distribution shift of offline samples and online samples

We present a comparison between the log-likelihood estimation of the distribution of offline samples during model training and the log-likelihood estimation of online samples collected by the offline RL agent. The divergence or convergence of log-likelihood estimates between offline and online samples becomes crucial. From Fig. 12, we observe that the model captures differences in the distribution of the two datasets, revealing its effectiveness in handling the transition from offline to online learning. By examining the log-likelihood estimation of data distribution in offline and online environments and its consistency with the target distribution, we gain insights into the model’s ability to ensure that the learned policy performs well not only on offline data but also generalizes effectively to dynamic and potentially different distributions encountered during online interactions.

1.6 B.6 Comparative analysis with CQL-fine tuning

The CQL algorithm aims to enhance sample efficiency and stability by introducing an additional constraint to limit the upper bound of Q-value estimates, thereby improving Q-learning. In other words, it corrects the estimation of Q-values in a conservative manner to enhance the robustness of training. However, when fine-tuning algorithms utilize CQL to update the Q-function with online data, erroneous peaks may occur on suboptimal actions (x-axis). A specific diagram is shown in Fig. 15. Such a process can lead to the deviation of the policy from high-reward actions covered by the dataset, favoring incorrect new actions and resulting in the degradation of the pretrained policy.

In contrast to the approach proposed in this paper, which is Multi-Agent Reinforcement Learning (MAERL), we incorporate a multi-action evaluation method when estimating the Q-function. This approach ensures both a conservative estimation of Q-values and coverage of all actions by the policy. During fine-tuning, it avoids missing the optimal values and, to some extent, accelerates exploration of new states.

1.7 B.7 Average Q-value and score evaluation

In this section, we visualize the changes in average Q-values and score evaluation during the model’s offline pre-training and online fine-tuning processes, as shown in Fig. 16. The fine-tuning starts at 10K steps, with the red portion indicating the performance recovery period, coinciding with the Q-value adjustment phase.

During the offline learning phase, the Q-values using CQL or IQL are lower than the true values. This is attributed to factors such as conservative estimation or importance weighting, designed to provide a conservative estimate on the offline dataset. As the conservative Q-function starts interacting with online data, it may exhibit better performance on extrapolation errors compared to the highly conservative estimated offline Q-function. Extrapolation error refers to the error in environments outside the known or offline dataset. This implies that the conservative Q-function may perform better in certain scenarios, even though it underestimates Q-values on the offline data. However, this can lead to a misleading emphasis on optimizing the policy for higher returns in regions with larger extrapolation errors. Such a situation may result in the algorithm neglecting the initial policy because it prioritizes policies performing well in specific environments.

1.8 B.8 Analysis of the reasons why overvalued explorations fail when fine-tuning CQL

In an experiment on CQL fine-tuning, we found that agent selects actions with higher value (Q-value) during exploration, and there is a deterioration of performance when MAERL is combined with CQL. The results of this experiment are now analyzed. First, the actual implementation of CQL for policy evaluation is reviewed:

$$\begin{aligned} & arg\min _{Q} \mathbb {E} _{s\in \mathcal {D} }\left[ log\sum _{a} exp(Q(s,a))-\mathbb {E}_{a\sim \pi _{\beta }(a\mid s) }\left[ Q(s,a) \right] \right] \nonumber \\ & \quad +\mathbb {E}_{(s,a,r,s^{'})\sim \mathcal {D} } \left[ (Q(s,a)-y)^{2} \right] , \end{aligned}$$

(B8)

where $\pi _{\beta }$ is the behavioral policy for the offline dataset, and y is the standard TD goal for policy evaluation. The formula can be seen that there are three terms in the above equation: given a state, the first minimizes the Q-value of all actions; the second maximizes the Q-value of the actions collected in the dataset; and the third is the standard TD-learning. therefore, given state s, when we select the higher-valued action $a_{h}$ to explore, transition($s,a_{h},s^{'},r$) will be collected into the replay buffer. Then, during the search process, combining the first and second terms in Eq.B8 leads to a situation where most of the unseen actions have a low value and the first term makes a conservative estimate of all Q-values and very few seen actions have a high value, while the second term in turn performs a higher value assessment problem for the action. In the third term, which involves bootstrapping, high value actions are updated to have lower values in such a way that the values of all actions are consistently reduced in each step of the gradient. This will lead to the breakdown of the Q-value function, which in turn will lead to the creation of poor policies through propagation in policy improvement. In MAERL, this problem is solved by using a multi-action evaluation approach that takes into account both high-value and low-value actions, thus demonstrating the problem of instability for value estimation as well as preference for Q-values in the exploration process.

Appendix C: Convergence analysis

In this section, we give some proofs of the formulas. Note that it is sufficient to discuss only the general meanings of the characters that are not explained.

For the condition of offline and online interaction, the learning and training processes are based on Q-learning, so only the two policies of Q-function converging to the immovable point are needed to converge. the convergence of Q-function can be proved to study by whether the corresponding bellman optimal operator is $\gamma -constraction$ or not.

For the Bellman optimality operator $\mathcal {H}$, given the prior probability $P_{z_{j}}$ that a specific task $z_{j} \in Z$ performs action a in the state of $Z_{j}$ task, with a reward of $r(s,a|z_{j})$, and a transition probability of $p(s' | s,a,z_{j})$, from state s to the next state $s'$ has the following formulas hold:

$$\begin{aligned} \mathcal {H}Q(a\mid s,z_{j})=r(s,a\mid z_{j})+\gamma \mathbb {E}_{s^{'}}\max _{a^{'}} Q(a^{'}\mid s^{'},z_{j}), \end{aligned}$$

(C9)

where $\mathbb {E}_{s^{'}}$ is short for $\mathbb {E}_{s^{'}\sim p(s' | s,a,z_{j})}$.

For the $z_{0}$, the Bellman optimality operator $\tau $, giving a state-action pair:

$$\begin{aligned} \tau Q(a\mid s,z_{j})\!=\!\mathbb {E}_{z_{0}}r(s,a\mid z_{j}) +\gamma \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q(a^{'}\!\mid \! s^{'},z_{0}), \end{aligned}$$

(C10)

where $\mathbb {E}_{z_{j}}$ is short for $\mathbb {E}_{z_{j}\sim p(z_{j})}$.

1.1 C.1 Convergence analysis of q-functions computation for different policies

For any two functions $Q_{1}(a\mid s,z_{0})$ and $Q_{2}(a\mid s,z_{0})$, the Bellman optimality operator $\tau $ is a $\gamma -constraction$ as follows: let $Q^{\beta }(a\mid s)\rightarrow $ offline Q-function, $Q^{\theta }(a\mid s)\rightarrow $ online Q-function, the following equations holds:

$$\begin{aligned} \left\| \tau Q^{\beta }_{1}(a\mid s)-\tau Q^{\beta }_{2}(a\mid s) \right\| _{\infty }\le \gamma \left\| Q^{\beta }_{1}(a\mid s)-Q^{\beta }_{2}(a\mid s) \right\| _{\infty } , \end{aligned}$$

(C11)

where $\left\| \cdot \right\| _{\infty } $ is the sup-norm and $\gamma \in (0,1)$. The same holds for $Q^{\theta }(a\mid s)$.

$$\begin{aligned} & \left\| \tau Q^{\beta }_{1}(a\mid s) - \tau Q^{\beta }_{2}(a\mid s) \right\| _{\infty }\nonumber \\= & \max _{s,a}\Bigg | \mathbb {E}_{z_{j}}r(s,a\mid z_{j}) +\gamma \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{1}^{\beta }(a^{'}\mid s^{'},z_{0}) -\mathbb {E}_{z_{j}}r(s,a\mid z_{j})\nonumber \\ & -\gamma \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\Bigg |\nonumber \\= & \gamma \max _{s,a}\left| \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{1}^{\beta }(a^{'}\mid s^{'},z_{0}) - \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\right| \nonumber \\= & \gamma \max _{s,a}\left| \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\left[ \max _{a^{'}} Q_{1}^{\beta }(a^{'}\mid s^{'},z_{0}) - \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}}\max _{a^{'}} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\right] \right| \nonumber \\\le & \gamma \max _{s,a} \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}} \left| \max _{a^{'}} Q_{1}^{\beta }(a^{'}\mid s^{'},z_{0}) - \max _{a^{'}} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\right| \nonumber \\\le & \gamma \max _{s,a} \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}} \!\left| \max _{a\in A,s\in S} Q_{1}^{\beta }(a^{'}\!\mid s^{'},z_{0}) \!-\! \max _{a\in A,s\in S} Q_{2}^{\beta }(a^{'}\mid s^{'},z_{0})\right| . \end{aligned}$$

(C12)

Express the above equations in normalized form:

$$\begin{aligned} & \left\| \tau Q^{\beta }_{1}(a\mid s)-\tau Q^{\beta }_{2}(a\mid s) \right\| _{\infty }\nonumber \\\le & \gamma \max _{a, s} \mathbb {E}_{z_{j}}\mathbb {E}_{s^{'}} \left| \max Q_{1}^{\beta }(a\mid s) - \max Q_{2}^{\beta }(a\mid s)\right| \nonumber \\\le & \gamma \max _{a, s} \mathbb {E}_{z_{j}}\mathbb {E} \left\| Q_{1}^{\beta }(a\mid s) - Q_{2}^{\beta }(a\mid s) \right\| _{\infty }\nonumber \\= & \gamma \left\| Q_{1}^{\beta }(a\mid s) - Q_{2}^{\beta }(a\mid s) \right\| _{\infty }. \end{aligned}$$

(C13)

Suppose that $Q^{*} \rightarrow Q^{*}(a \mid s)$ is the optimal goal Q-value for the policy, and $Q^{*} = \tau Q^{*}$ holds. The following equation holds by iteration:

$$\begin{aligned} 0\le \left\| Q_{k+1}^{*} -Q_{k}^{*}\right\| _{\infty }\le \gamma \left\| Q_{k}^{*} \!-\!Q_{k-1}^{*}\right\| _{\infty }\le \cdots \le \gamma ^{k+1}\left\| Q_{1}^{*} -Q_{0}^{*}\right\| _{\infty } ; \end{aligned}$$

(C14)

$$\begin{aligned} \lim _{k \rightarrow \infty } \gamma ^{k+1}\left\| Q_{1}^{*} -Q_{0}^{*}\right\| =0, \gamma \in (0,1) . \end{aligned}$$

(C15)

Therefore, for any tasks, the state-action pair (s, a) converges stably to the optimal immobilization point.

1.2 C.2 Latent policy optimization problem

The discussion in this section focuses on offline policies $\pi _{\beta }$, online policies $\pi _{\theta }$, and sample latent policy $\pi _{latent}$ problems.

$$\begin{aligned} arg \max _{a} Q^{*}(a\mid s)=arg \max _{a}Q^{\pi }(a\mid s). \end{aligned}$$

(C16)

In the case where the above equation holds, there is the following equation:

$$\begin{aligned} arg \max _{a}Q^{\pi }(a\mid s)=arg \max _{a}Q^{\pi \rightarrow \left\langle {\pi ,Z_{o}}\right\rangle }(a\mid s). \end{aligned}$$

(C17)

The initialized latent policy may not be optimal. During the fine-tuning process, we have the opportunity to understand potential relationships within state-action pairs. Subsequently, a combination share module is employed to map potential actions to the actual action space, incorporating the learned mappings to exert a priori influence on subsequent actions. The potential action orientation, associated with each customized action, facilitates the selection of potential actions within the constrained action space, offering a certain degree of flexibility.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cheng, X., Huang, X., Huang, Z. et al. An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension. Appl Intell 54, 12246–12271 (2024). https://doi.org/10.1007/s10489-024-05806-2

Download citation

Accepted: 22 August 2024
Published: 11 September 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10489-024-05806-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Goal-conditioned offline reinforcement learning through state space partitioning

Model-Based Offline Adaptive Policy Optimization with Episodic Memory

BiES: Adaptive Policy Optimization for Model-Based Offline Reinforcement Learning

Explore related subjects

Data availability and access

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and informed consent for data used

Competing Interests

Additional information

Publisher's Note

Appendices

Appendix A: Algorithm

Appendix B: Baseline implementations and experimental details

1.1 B.1 Parameters setting and hyper-parametric studies

1.2 B.2 Combining behavior cloning and online reinforcement learning

1.3 B.3 Combining Behavior Transfer and Online Reinforcement Learning

1.4 B.4 Parameter description for offline-to-online reinforcement learning

1.5 B.5 Distribution shift of offline samples and online samples

1.6 B.6 Comparative analysis with CQL-fine tuning

1.7 B.7 Average Q-value and score evaluation

1.8 B.8 Analysis of the reasons why overvalued explorations fail when fine-tuning CQL

Appendix C: Convergence analysis

1.1 C.1 Convergence analysis of q-functions computation for different policies

1.2 C.2 Latent policy optimization problem

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now