research-article

Actor-Critic Algorithms with Online Feature Adaptation

Authors:

K. J. Prabuchandran,

Shalabh Bhatnagar,

Vivek S. BorkarAuthors Info & Claims

ACM Transactions on Modeling and Computer Simulation (TOMACS), Volume 26, Issue 4

Article No.: 24, Pages 1 - 26

https://doi.org/10.1145/2868723

Published: 09 February 2016 Publication History

Abstract

We develop two new online actor-critic control algorithms with adaptive feature tuning for Markov Decision Processes (MDPs). One of our algorithms is proposed for the long-run average cost objective, while the other works for discounted cost MDPs. Our actor-critic architecture incorporates parameterization both in the policy and the value function. A gradient search in the policy parameters is performed to improve the performance of the actor. The computation of the aforementioned gradient, however, requires an estimate of the value function of the policy corresponding to the current actor parameter. The value function, on the other hand, is approximated using linear function approximation and obtained from the critic. The error in approximation of the value function, however, results in suboptimal policies. In our article, we also update the features by performing a gradient descent on the Grassmannian of features to minimize a mean square Bellman error objective in order to find the best features. The aim is to obtain a good approximation of the value function and thereby ensure convergence of the actor to locally optimal policies. In order to estimate the gradient of the objective in the case of the average cost criterion, we utilize the policy gradient theorem, while in the case of the discounted cost objective, we utilize the simultaneous perturbation stochastic approximation (SPSA) scheme. We prove that our actor-critic algorithms converge to locally optimal policies. Experiments on two different settings show performance improvements resulting from our feature adaptation scheme.

Supplementary Material

prabuchandran (prabuchandran.zip)

Supplemental movie, appendix, image and software files for, Actor-Critic Algorithms with Online Feature Adaptation

Download
701.67 KB

References

[1]

P. A. Absil, R. Mahony, and R. Sepulchre. 2009. Optimization Algorithms on Matrix Manifolds. Princeton University Press.

Digital Library

[2]

L. C. Baird. 1995. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning. 30--37.

[3]

J. S. Baras and V. S. Borkar. 2000. A learning algorithm for Markov decision processes with adaptive state aggregation. In Proceedings of the 39th IEEE Conference on Decision and Control, Vol. 4. 3351--3356.

[4]

A. G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.

Digital Library

[5]

A. G. Barto, R. S. Sutton, and C. W. Anderson. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, 5 (1983), 834--846.

[6]

D. P. Bertsekas. 2011. Dynamic Programming and Optimal Control. Vol. 2, 4th ed. Athena Scientific, Belmont, MA.

[7]

S. Bhatnagar, V. S. Borkar, and K. J. Prabuchandran. 2013a. Feature search in the Grassmanian in online reinforcement learning. IEEE Journal of Selected Topics in Signal Processing 7, 5 (2013a), 746--758.

[8]

S. Bhatnagar, V. S. Borkar, and L. A. Prashanth. 2012. Adaptive feature pursuit: Online adaptation of features in reinforcement learning. Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. IEEE Press Computational Intelligence Science, IEEE Press and Wiley, 517--534.

[9]

S. Bhatnagar, H. L. Prasad, and L. A. Prashanth. 2013b. Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods. Springer.

[10]

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. 2009. Natural actor--critic algorithms. Automatica 45, 11 (2009), 2471--2482.

Digital Library

[11]

V. S. Borkar. 1997. Stochastic approximation with two time scales. Systems & Control Letters 29, 5 (1997), 291--294.

Digital Library

[12]

V. S. Borkar. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.

[13]

D. D. Castro and S. Mannor. 2010. Adaptive bases for reinforcement learning. Machine Learning and Knowledge Discovery in Databases (2010), 312--327.

Digital Library

[14]

A. Edelman, T. A. Arias, and S. T. Smith. 1998. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications 20, 2 (1998), 303--353.

Digital Library

[15]

J. Hamm and D. D. Lee. 2008. Grassmann discriminant analysis: A unifying view on subspace-based learning. In Proceedings of the 25th International Conference on Machine Learning. ACM, 376--383.

Digital Library

[16]

P. W. Keller, S. Mannor, and D. Precup. 2006. Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 449--456.

Digital Library

[17]

V. R. Konda and J. N. Tsitsiklis. 2003. Onactor-critic algorithms. SIAM Journal on Control and Optimization 42, 4 (2003), 1143--1166.

Digital Library

[18]

V. R. Konda and J. N. Tsitsiklis. 2004. Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability 14, 2 (2004), 796--819.

[19]

H. J. Kushner and D. S. Clark. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Vol. 6. Springer-Verlag, New York.

[20]

M. G. Lagoudakis and R. Parr. 2003. Least-squares policy iteration. Journal of Machine Learning Research 4 (2003), 1107--1149.

Digital Library

[21]

S. Mahadevan and B. Liu. 2010. Basis construction from power series expansions of value functions. In Advances in Neural Information Processing Systems. 1540--1548.

[22]

S. Mahadevan and M. Maggioni. 2007. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research 8, 16 (2007), 2169--2231.

Digital Library

[23]

P. Marbach and J. N Tsitsiklis. 2001. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46, 2 (2001), 191--209.

[24]

I. Menache, S. Mannor, and N. Shimkin. 2005. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134, 1 (2005), 215--238.

[25]

G. Meyer, S. Bonnabel, and R. Sepulchre. 2011. Regression on fixed-rank positive semidefinite matrices: A Riemannian approach. Journal of Machine Learning Research 12 (2011), 593--625.

Digital Library

[26]

R. Parr, C. Painter-Wakefield, L. Li, and M. Littman. 2007. Analyzing feature generation for value-function approximation. In Proceedings of the 24th International Conference on Machine Learning. 737--744.

Digital Library

[27]

K. J. Prabuchandran, S. Bhatnagar, and V. S. Borkar. 2014. An actor critic algorithm based on Grassmanian search. In Proceedings of the 53rd IEEE Conference on Decision and Control. 3597--3602.

[28]

K. Rohanimanesh, N. Roy, and R. Tedrake. 2009. Towards feature selection in actor-critic algorithms. In Workshop on Abstraction in Reinforcement Learning. 42--48.

[29]

S. T. Smith. 1993. Geometric Optimization Methods for Adaptive Filtering. Harvard University, Cambridge, MA.

[30]

J. C. Spall. 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 3 (1992), 332--341.

[31]

Y. Sun, M. Ring, J. Schmidhuber, and F. J. Gomez. 2011. Incremental basis construction from temporal difference error. In Proceedings of the 28th International Conference on Machine Learning. 481--488.

[32]

R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, Vol. 12. 1057--1063.

Digital Library

[33]

P. S. Thomas, W. C. Dabney, S. Giguere, and S. Mahadevan. 2013. Projected natural actor-critic. In Advances in Neural Information Processing Systems. 2337--2345.

[34]

J. N. Tsitsiklis and B. Van Roy. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 5 (1997), 674--690.

[35]

J. N. Tsitsiklis and B. Van Roy. 1999. Average cost temporal-difference learning. Automatica 35, 11 (1999), 1799--1808.

Digital Library

[36]

L. Wolf and A. Shashua. 2003. Learning over sets using kernel principal angles. Journal of Machine Learning Research 4 (2003), 913--931.

Digital Library

[37]

H. Yu and D. P. Bertsekas. 2009. Basis function adaptation methods for cost approximation in MDP. In Adaptive Dynamic Programming and Reinforcement Learning. IEEE, 74--81.

Cited By

Li JZhang Z(2023)A versatile dynamic noise control framework based on computer simulation and modelingNonlinear Engineering10.1515/nleng-2022-027212:1Online publication date: 7-Jun-2023
https://doi.org/10.1515/nleng-2022-0272
Diddigi RJain PJ PBhatnagar S(2022)Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892303(1-10)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892303
Grosnit ACai DWynter L(2021)Decentralized Deterministic Multi-Agent Reinforcement Learning2021 60th IEEE Conference on Decision and Control (CDC)10.1109/CDC45484.2021.9683356(1548-1553)Online publication date: 14-Dec-2021
https://doi.org/10.1109/CDC45484.2021.9683356
Show More Cited By

Recommendations

Natural actor-critic algorithms

We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy ...
Actor-Critic--Type Learning Algorithms for Markov Decision Processes

Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence ...
On Actor-Critic Algorithms

In this article, we propose and analyze a class of actor-critic algorithms. These are two-time-scale algorithms in which the critic uses temporal difference learning with a linearly parameterized approximation architecture, and the actor is updated in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Modeling and Computer Simulation

ACM Transactions on Modeling and Computer Simulation Volume 26, Issue 4

May 2016

147 pages

ISSN:1049-3301

EISSN:1558-1195

DOI:10.1145/2892241

Editor:
Adelinde M. Uhrmacher
University of Rostock, Germany

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2016

Accepted: 01 December 2015

Revised: 01 December 2015

Received: 01 June 2014

Published in TOMACS Volume 26, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Department of Science and Technology
Robert Bosch Centre
Department of Science and Technology for a project titled “Distributed Computation over Large Networks and High-Dimensional Data Analysis.”
Xerox Corporation, USA
Tata Consultancy Services

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
375
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li JZhang Z(2023)A versatile dynamic noise control framework based on computer simulation and modelingNonlinear Engineering10.1515/nleng-2022-027212:1Online publication date: 7-Jun-2023
https://doi.org/10.1515/nleng-2022-0272
Diddigi RJain PJ PBhatnagar S(2022)Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892303(1-10)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892303
Grosnit ACai DWynter L(2021)Decentralized Deterministic Multi-Agent Reinforcement Learning2021 60th IEEE Conference on Decision and Control (CDC)10.1109/CDC45484.2021.9683356(1548-1553)Online publication date: 14-Dec-2021
https://doi.org/10.1109/CDC45484.2021.9683356
Avrachenkov KBorkar VDolhare HPatil K(2021)Full Gradient DQN Reinforcement Learning: A Provably Convergent SchemeModern Trends in Controlled Stochastic Processes:10.1007/978-3-030-76928-4_10(192-220)Online publication date: 5-Jun-2021
https://doi.org/10.1007/978-3-030-76928-4_10
Li LLi DSong T(2020)Feature selection in deterministic policy gradientThe Journal of Engineering10.1049/joe.2019.11932020:13(403-406)Online publication date: 27-Jul-2020
https://doi.org/10.1049/joe.2019.1193

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents