Impact Statement:RL algorithms integrated with deep learning architectures (called DRL) have achieved immense success in a wide range of practical applications such as robotics, game theo...Show More
Abstract:
Reinforcement learning (RL) algorithms combined with deep learning architectures have achieved tremendous success in many practical applications. However, the policies ob...Show MoreMetadata
Impact Statement:
RL algorithms integrated with deep learning architectures (called DRL) have achieved immense success in a wide range of practical applications such as robotics, game theory, and natural language processing. Deep AC is one of the most popular DRL algorithms that combines the benefits of both policy-based and value-based RL methods. However, it is observed that deep AC algorithms are not free from stability issues caused by high variance, which makes them less useful in critical applications such as finance. In this work, we propose an “optimal L-step AC with general approximation architecture (optimal L-AC-GAA)” algorithm, which while giving the optimal policy, also provides the minimum estimator variance, a result that we establish both theoretically and experimentally. Such a result had been lacking in all prior works to the best of our knowledge.
Abstract:
Reinforcement learning (RL) algorithms combined with deep learning architectures have achieved tremendous success in many practical applications. However, the policies obtained by many deep reinforcement learning (DRL) algorithms are seen to suffer from high variance making them less useful in safety-critical applications. In general, it is desirable to have algorithms that give a low iterate-variance while providing a high long-term reward. In this work, we consider the actor–critic (AC) paradigm, where the critic is responsible for evaluating the policy while the feedback from the critic is used by the actor for updating the policy. The updates of both the critic and the actor in the standard AC procedure are run concurrently until convergence. It has been previously observed that updating the actor once after every L>1 steps of the critic reduces the iterate variance. In this article, we address the question of what optimal L-value to use in the recursions and propose a data-dri...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 7, July 2024)