Conferences >2010 IEEE International Confe...

Dyna-like reinforcement learning based on accumulative and average rewards

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

An approach to accelerating the learning process of the actor-critic learning algorithm for reinforcement learning is presented. The algorithm was derived from principles...Show More

Metadata

Abstract:

An approach to accelerating the learning process of the actor-critic learning algorithm for reinforcement learning is presented. The algorithm was derived from principles based on the prediction of average rewards and temporal difference (TD) learning with averaged and discounted rewards. The derived algorithm was applied to neural networks, demonstrating their effective operation in nonlinear control problems. The motivation of the proposed algorithm was to elaborate how a learning scheme, implemented by artificial neural networks (ANNs), can speed up learning processes based on an arrangement akin to the Dyna-Q learning, where a simulative model of the controlled plant is established for virtual learning between two control cycles. Instead of modeling the complicated plant, the approach just introduced a simple predictor of rewards for virtual learning in simulation mode. Two TD learning methods based discounted and averaged rewards respectively, are used alternatively in the control and simulation mode to facilitate the derived algorithm. The proposed Alternative Learning Critic (ALC) algorithm consists of two sub-systems: one is Evaluation Predictor (EP), which performs an approximation of a long-term evaluation function, and the other is an immediate action selector, which is composed of two ANNs: Action Controller (AC) and Reinforcement Predictor (RP). The proposed learning scheme is then applied to control a pendulum system for tracking a desired trajectory to demonstrate its applausive performance and robustness. Through reinforcement signals from the environment, the system takes an appropriate action to a plant with unknown dynamics so the actual output of the plant can track the desired one concisely within a short learning cycles. Further, ALC is used as the compensator of a PI controller, which is actually only working well on a linear system, to control that pendulum system. The results show the affined system, trained ALC and the PI controller can man...

Published in: 2010 IEEE International Conference on Systems, Man and Cybernetics

Date of Conference: 10-13 October 2010

Date Added to IEEE Xplore: 22 November 2010

ISBN Information:

ISSN Information:

DOI: 10.1109/ICSMC.2010.5642415

Conference Location: Istanbul, Turkey

Contents

References is not available for this document.

Dyna-like reinforcement learning based on accumulative and average rewards

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Dyna-like reinforcement learning based on accumulative and average rewards

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?