Loading web-font TeX/Math/Italic
Improving Exploration in Actor–Critic With Weakly Pessimistic Value Estimation and Optimistic Policy Optimization | IEEE Journals & Magazine | IEEE Xplore

Improving Exploration in Actor–Critic With Weakly Pessimistic Value Estimation and Optimistic Policy Optimization


Abstract:

Deep off-policy actor–critic algorithms have been successfully applied to challenging tasks in continuous control. However, these methods typically suffer from the poor s...Show More

Abstract:

Deep off-policy actor–critic algorithms have been successfully applied to challenging tasks in continuous control. However, these methods typically suffer from the poor sample efficiency problem, limiting their widespread adoption in real-world domains. To mitigate this issue, we propose a novel actor–critic algorithm with weakly pessimistic value estimation and optimistic policy optimization (WPVOP) for continuous control. WPVOP integrates two key ingredients: 1) a weakly pessimistic value estimation, which compensates the pessimism of lower confidence bound in conventional value function (i.e., clipped double Q -learning) to trigger exploration in low-value state-action regions and 2) an optimistic policy optimization algorithm by sampling actions that could benefit the policy learning most toward optimal Q -values for efficient exploration. We theoretically analyze that the proposed weakly pessimistic value estimation method is lower and upper bounded, and empirically show that it could avoid extremely over-optimistic value estimates. We show that these two ideas are largely complementary, and can be fruitfully integrated to improve performance and promote sample efficiency of exploration. We evaluate WPVOP on the suite of continuous control tasks from MuJoCo, achieving state-of-the-art sample efficiency and performance.
Page(s): 8783 - 8796
Date of Publication: 28 October 2022

ISSN Information:

PubMed ID: 36306289

Funding Agency:


References

References is not available for this document.