skip to main content
research-article

Actor-Critic Algorithms with Online Feature Adaptation

Published: 09 February 2016 Publication History

Abstract

We develop two new online actor-critic control algorithms with adaptive feature tuning for Markov Decision Processes (MDPs). One of our algorithms is proposed for the long-run average cost objective, while the other works for discounted cost MDPs. Our actor-critic architecture incorporates parameterization both in the policy and the value function. A gradient search in the policy parameters is performed to improve the performance of the actor. The computation of the aforementioned gradient, however, requires an estimate of the value function of the policy corresponding to the current actor parameter. The value function, on the other hand, is approximated using linear function approximation and obtained from the critic. The error in approximation of the value function, however, results in suboptimal policies. In our article, we also update the features by performing a gradient descent on the Grassmannian of features to minimize a mean square Bellman error objective in order to find the best features. The aim is to obtain a good approximation of the value function and thereby ensure convergence of the actor to locally optimal policies. In order to estimate the gradient of the objective in the case of the average cost criterion, we utilize the policy gradient theorem, while in the case of the discounted cost objective, we utilize the simultaneous perturbation stochastic approximation (SPSA) scheme. We prove that our actor-critic algorithms converge to locally optimal policies. Experiments on two different settings show performance improvements resulting from our feature adaptation scheme.

Supplementary Material

prabuchandran (prabuchandran.zip)
Supplemental movie, appendix, image and software files for, Actor-Critic Algorithms with Online Feature Adaptation

References

[1]
P. A. Absil, R. Mahony, and R. Sepulchre. 2009. Optimization Algorithms on Matrix Manifolds. Princeton University Press.
[2]
L. C. Baird. 1995. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning. 30--37.
[3]
J. S. Baras and V. S. Borkar. 2000. A learning algorithm for Markov decision processes with adaptive state aggregation. In Proceedings of the 39th IEEE Conference on Decision and Control, Vol. 4. 3351--3356.
[4]
A. G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.
[5]
A. G. Barto, R. S. Sutton, and C. W. Anderson. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, 5 (1983), 834--846.
[6]
D. P. Bertsekas. 2011. Dynamic Programming and Optimal Control. Vol. 2, 4th ed. Athena Scientific, Belmont, MA.
[7]
S. Bhatnagar, V. S. Borkar, and K. J. Prabuchandran. 2013a. Feature search in the Grassmanian in online reinforcement learning. IEEE Journal of Selected Topics in Signal Processing 7, 5 (2013a), 746--758.
[8]
S. Bhatnagar, V. S. Borkar, and L. A. Prashanth. 2012. Adaptive feature pursuit: Online adaptation of features in reinforcement learning. Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. IEEE Press Computational Intelligence Science, IEEE Press and Wiley, 517--534.
[9]
S. Bhatnagar, H. L. Prasad, and L. A. Prashanth. 2013b. Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods. Springer.
[10]
S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. 2009. Natural actor--critic algorithms. Automatica 45, 11 (2009), 2471--2482.
[11]
V. S. Borkar. 1997. Stochastic approximation with two time scales. Systems & Control Letters 29, 5 (1997), 291--294.
[12]
V. S. Borkar. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.
[13]
D. D. Castro and S. Mannor. 2010. Adaptive bases for reinforcement learning. Machine Learning and Knowledge Discovery in Databases (2010), 312--327.
[14]
A. Edelman, T. A. Arias, and S. T. Smith. 1998. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications 20, 2 (1998), 303--353.
[15]
J. Hamm and D. D. Lee. 2008. Grassmann discriminant analysis: A unifying view on subspace-based learning. In Proceedings of the 25th International Conference on Machine Learning. ACM, 376--383.
[16]
P. W. Keller, S. Mannor, and D. Precup. 2006. Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 449--456.
[17]
V. R. Konda and J. N. Tsitsiklis. 2003. Onactor-critic algorithms. SIAM Journal on Control and Optimization 42, 4 (2003), 1143--1166.
[18]
V. R. Konda and J. N. Tsitsiklis. 2004. Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability 14, 2 (2004), 796--819.
[19]
H. J. Kushner and D. S. Clark. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Vol. 6. Springer-Verlag, New York.
[20]
M. G. Lagoudakis and R. Parr. 2003. Least-squares policy iteration. Journal of Machine Learning Research 4 (2003), 1107--1149.
[21]
S. Mahadevan and B. Liu. 2010. Basis construction from power series expansions of value functions. In Advances in Neural Information Processing Systems. 1540--1548.
[22]
S. Mahadevan and M. Maggioni. 2007. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research 8, 16 (2007), 2169--2231.
[23]
P. Marbach and J. N Tsitsiklis. 2001. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46, 2 (2001), 191--209.
[24]
I. Menache, S. Mannor, and N. Shimkin. 2005. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134, 1 (2005), 215--238.
[25]
G. Meyer, S. Bonnabel, and R. Sepulchre. 2011. Regression on fixed-rank positive semidefinite matrices: A Riemannian approach. Journal of Machine Learning Research 12 (2011), 593--625.
[26]
R. Parr, C. Painter-Wakefield, L. Li, and M. Littman. 2007. Analyzing feature generation for value-function approximation. In Proceedings of the 24th International Conference on Machine Learning. 737--744.
[27]
K. J. Prabuchandran, S. Bhatnagar, and V. S. Borkar. 2014. An actor critic algorithm based on Grassmanian search. In Proceedings of the 53rd IEEE Conference on Decision and Control. 3597--3602.
[28]
K. Rohanimanesh, N. Roy, and R. Tedrake. 2009. Towards feature selection in actor-critic algorithms. In Workshop on Abstraction in Reinforcement Learning. 42--48.
[29]
S. T. Smith. 1993. Geometric Optimization Methods for Adaptive Filtering. Harvard University, Cambridge, MA.
[30]
J. C. Spall. 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 3 (1992), 332--341.
[31]
Y. Sun, M. Ring, J. Schmidhuber, and F. J. Gomez. 2011. Incremental basis construction from temporal difference error. In Proceedings of the 28th International Conference on Machine Learning. 481--488.
[32]
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, Vol. 12. 1057--1063.
[33]
P. S. Thomas, W. C. Dabney, S. Giguere, and S. Mahadevan. 2013. Projected natural actor-critic. In Advances in Neural Information Processing Systems. 2337--2345.
[34]
J. N. Tsitsiklis and B. Van Roy. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 5 (1997), 674--690.
[35]
J. N. Tsitsiklis and B. Van Roy. 1999. Average cost temporal-difference learning. Automatica 35, 11 (1999), 1799--1808.
[36]
L. Wolf and A. Shashua. 2003. Learning over sets using kernel principal angles. Journal of Machine Learning Research 4 (2003), 913--931.
[37]
H. Yu and D. P. Bertsekas. 2009. Basis function adaptation methods for cost approximation in MDP. In Adaptive Dynamic Programming and Reinforcement Learning. IEEE, 74--81.

Cited By

View all
  • (2023)A versatile dynamic noise control framework based on computer simulation and modelingNonlinear Engineering10.1515/nleng-2022-027212:1Online publication date: 7-Jun-2023
  • (2022)Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892303(1-10)Online publication date: 18-Jul-2022
  • (2021)Decentralized Deterministic Multi-Agent Reinforcement Learning2021 60th IEEE Conference on Decision and Control (CDC)10.1109/CDC45484.2021.9683356(1548-1553)Online publication date: 14-Dec-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Modeling and Computer Simulation
ACM Transactions on Modeling and Computer Simulation  Volume 26, Issue 4
May 2016
147 pages
ISSN:1049-3301
EISSN:1558-1195
DOI:10.1145/2892241
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2016
Accepted: 01 December 2015
Revised: 01 December 2015
Received: 01 June 2014
Published in TOMACS Volume 26, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Grassmann manifold
  2. Markov decision processes
  3. SPSA
  4. actor-critic algorithms
  5. feature adaptation
  6. function approximation
  7. online learning
  8. policy gradients
  9. residual gradient scheme
  10. stochastic approximation
  11. temporal difference learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Department of Science and Technology
  • Robert Bosch Centre
  • Department of Science and Technology for a project titled “Distributed Computation over Large Networks and High-Dimensional Data Analysis.”
  • Xerox Corporation, USA
  • Tata Consultancy Services

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A versatile dynamic noise control framework based on computer simulation and modelingNonlinear Engineering10.1515/nleng-2022-027212:1Online publication date: 7-Jun-2023
  • (2022)Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892303(1-10)Online publication date: 18-Jul-2022
  • (2021)Decentralized Deterministic Multi-Agent Reinforcement Learning2021 60th IEEE Conference on Decision and Control (CDC)10.1109/CDC45484.2021.9683356(1548-1553)Online publication date: 14-Dec-2021
  • (2021)Full Gradient DQN Reinforcement Learning: A Provably Convergent SchemeModern Trends in Controlled Stochastic Processes:10.1007/978-3-030-76928-4_10(192-220)Online publication date: 5-Jun-2021
  • (2020)Feature selection in deterministic policy gradientThe Journal of Engineering10.1049/joe.2019.11932020:13(403-406)Online publication date: 27-Jul-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media