Abstract:
Offline reinforcement learning (ORL) learns policy from a static dataset without further interaction with the environment, which holds significant promise in industrial c...Show MoreMetadata
Abstract:
Offline reinforcement learning (ORL) learns policy from a static dataset without further interaction with the environment, which holds significant promise in industrial control systems characterized by inefficient online interaction and inherent safety concerns. To mitigate the extrapolation error induced by distribution shift, it is essential for ORL to constrain the learned policy to perform actions within the support set of behavior policy. Existing methods fail to represent the behavior policy properly and typically tend to prefer actions with higher densities within the support set, resulting suboptimal learned policy. This article proposes a novel ORL method which represents the behavior policy with a diffusion model and trains a reverse diffusion guide policy to instruct the pretrained diffusion model in generating actions. The diffusion model exhibits stable training and strong distribution expression ability, and the reverse diffusion guide policy can effectively explore the entire support set to help generate the optimal action. When facing low-quality datasets, a trainable perturbation can be further added to the generated action to help the learned policy escape the performance limitation of behavior policy. Experimental results on D4RL Gym-MuJoCo benchmark demonstrate the effectiveness of the proposed method, surpassing several state-of-the-art ORL methods.
Published in: IEEE Transactions on Industrial Informatics ( Volume: 20, Issue: 10, October 2024)