Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Xu, Xin; Liu, Chunming; Hu, Dewen

doi:10.1007/s00500-010-0581-3

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Focus
Published: 28 March 2010

Volume 15, pages 1055–1070, (2011)
Cite this article

Soft Computing Aims and scope Submit manuscript

Xin Xu¹,
Chunming Liu¹ &
Dewen Hu¹

563 Accesses
16 Citations
Explore all metrics

Abstract

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge Gradient for Online Reinforcement Learning

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Article 13 February 2018

Sparse Kernel-Based Least Squares Temporal Difference with Prioritized Sweeping

References

Bach FR, Jordan MI (2002) Kernel independent component analysis. J Mach Learn Res 3:1–48
Article MathSciNet Google Scholar
Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13(5):835–846
Google Scholar
Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J Artif Intell Res 15:319–350
MathSciNet MATH Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neurodynamic programming. Athena Scientific, Belmont
Google Scholar
Boyan J (2002) Technical update: least-squares temporal difference learning. Mach Learn 49(2–3):233–246
Article MATH Google Scholar
Crites RH, Barto AG (1998) Elevator group control using multiple reinforcement learning agents. Mach Learn 33(2–3):235–262
Article MATH Google Scholar
Dayan P (1992) The convergence of TD(λ) for general λ. Mach Learn 8:341–362
MATH Google Scholar
Dayan P, Sejnowski TJ (1994) TD(λ) converges with probability 1. Mach Learn 14:295–301
Google Scholar
Engel Y, Mannor S, Meir R (2004) The kernel recursive least-squares algorithm. IEEE Trans Signal Process 52(8):2275–2285
Article MathSciNet Google Scholar
Hasselt HV, Wiering M (2007) Reinforcement learning in continuous action spaces. In: 2007 IEEE symposium on approximate dynamic programming and reinforcement learning, pp 272–279
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Google Scholar
Lagoudakis MG, Parr R (2003) Least-squares policy iteration. J Mach Learn Res 4:1107–1149
Article MathSciNet Google Scholar
Lazaric A, Restelli M, Bonarini A (2008) Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In: Advances in neural information processing systems. MIT Press, Cambridge
Mahadevan S, Maggioni M (2007) Proto-value functions: a Laplacian framework for learning representation and control in Markov decision processes. J Mach Learn Res 8:2169–2231
MathSciNet Google Scholar
Millan JDR, Posenato D, Dedieu E (2002) Continuous-action q-learning. Mach Learn 49(2/3):247–265
Article MATH Google Scholar
Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–1007
Article Google Scholar
Rasmussen CE, Kuss M (2004) Gaussian processes in reinforcement learning. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT Press, Cambridge, pp 751–759
Schölkopf B, Smola A (2002) Learning with kernels. MIT Press, Cambridge
Google Scholar
Singh SP, Jaakkola T, Littman ML, Szepesvari C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38:287–308
Article MATH Google Scholar
Sutton R (1988) Learning to predict by the method of temporal differences. Mach Learn 3(1):9–44
Google Scholar
Sutton R (1996) Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Advances in neural information processing systems, vol 8. MIT Press, Cambridge, pp 1038–1044
Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Google Scholar
Tesauro G (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput 6:215–219
Article Google Scholar
Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Mach Learn 16:185–202
MATH Google Scholar
Tsitsiklis JN, Roy BV (1997) An analysis of temporal difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690
Article MATH Google Scholar
Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8:279–292
MATH Google Scholar
Whiteson S, Stone P (2006) Evolutionary function approximation for reinforcement learning. J Mach Learn Res 7:877–917
MathSciNet Google Scholar
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256
MATH Google Scholar
Xu X, Hu DW, Lu XC (2007) Kernel-based least-squares policy iteration for reinforcement learning. IEEE Trans Neural Netw 18(4):973–997
Article Google Scholar
Zhang W, Dietterich T (1995) A reinforcement learning approach to job-shop scheduling. In: Proceedings of the fourteenth international joint conference on artificial intelligence. Morgan Kaufmann, pp 1114–1120

Download references

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful comments. This research is supported by the National Natural Science Foundation of China (NSFC) under Grant 60774076 and 90820302, the Fork Ying Tung Education Foundation under Grant 114005, National Basic Research Program of China (2007CB311001), Ph.D. Programs Foundation of Ministry of Education of China and the Natural Science Foundation of Hunan Province under Grant 2007JJ3122.

Author information

Authors and Affiliations

College of Mechatronics and Automation, Institute of Automation, National University of Defense Technology, ChangSha, Hunan, 410073, People’s Republic of China
Xin Xu, Chunming Liu & Dewen Hu

Authors

Xin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Chunming Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dewen Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, X., Liu, C. & Hu, D. Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Comput 15, 1055–1070 (2011). https://doi.org/10.1007/s00500-010-0581-3

Download citation

Published: 28 March 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s00500-010-0581-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Abstract

Access this article

Similar content being viewed by others

Knowledge Gradient for Online Reinforcement Learning

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Sparse Kernel-Based Least Squares Temporal Difference with Prioritized Sweeping

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Abstract

Access this article

Similar content being viewed by others

Knowledge Gradient for Online Reinforcement Learning

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Sparse Kernel-Based Least Squares Temporal Difference with Prioritized Sweeping

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation