A data-based online reinforcement learning algorithm satisfying probably approximately correct principle

Zhu, Yuanheng; Zhao, Dongbin

doi:10.1007/s00521-014-1738-2

A data-based online reinforcement learning algorithm satisfying probably approximately correct principle

Original Article
Published: 30 October 2014

Volume 26, pages 775–787, (2015)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Yuanheng Zhu¹ &
Dongbin Zhao¹

382 Accesses
13 Citations
Explore all metrics

Abstract

This paper proposes a probably approximately correct (PAC) algorithm that directly utilizes online data efficiently to solve the optimal control problem of continuous deterministic systems without system parameters for the first time. The dependence on some specific approximation structures is crucial to limit the wide application of online reinforcement learning (RL) algorithms. We utilize the online data directly with the kd-tree technique to remove this limitation. Moreover, we design the algorithm in the PAC principle. Complete theoretical proofs are presented, and three examples are simulated to verify its good performance. It draws the conclusion that the proposed RL algorithm specifies the maximum running time to reach a near-optimal control policy with only online data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey: Limited Data Problem and Strategy of Reinforcement Learning

An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method

Article 03 July 2018

Reinforcement Learning Informed by Optimal Control

References

Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Google Scholar
Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators. CRC Press, New York
Book Google Scholar
Tan AH, Ong YS, Tapanuj A (2011) A hybrid agent architecture integrating desire, intention and reinforcement learning. Expert Syst Appl 38(7):8477–8487
Article Google Scholar
Tang L, Liu Y-J, Tong S (2014) Adaptive neural control using reinforcement learning for a class of robot manipulator. Neural Comput Appl 25(1):135–141
Wang D, Liu D, Zhao D, Huang Y, Zhang D (2013) A neural-network-based iterative GDHP approach for solving a class of nonlinear optimal control problems with control constraints. Neural Comput Appl 22(2):219–227
Article Google Scholar
Wei Q, Liu D (2014) Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems. Neural Comput Appl 24(6):1355–1367
Article Google Scholar
Wang B, Zhao D, Alippi C, Liu D (2014) Dual heuristic dynamic programming for nonlinear discrete-time uncertain systems with state delay. Neurocomputing 134:222–229
Article Google Scholar
Watkins C (1989) Learning from delayed rewards. PhD thesis, Cambridge University, Cambridge
ten Hagen S, Kröse B (2003) Neural Q-learning. Neural Comput Appl 12(2):81–88
Article Google Scholar
Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist systems. Tech. Rep. TR 166, Cambridge University Engineering Department, Cambridge, England
Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming. IEEE Trans Autom Sci Eng 9(3):628–634
Article Google Scholar
Thrun SB (1992) The role of exploration in learning control. In: White D, Sofge D (eds) Handbook for intelligent control: neural, fuzzy and adaptive approaches. Van Nostrand Reinhold, Florence, Kentucky 41022
Zhao D, Hu Z, Xia Z, Alippi C, Wang D (2014) Full range adaptive cruise control based on supervised adaptive dynamic programming. Neurocomputing 125:57–67
Article Google Scholar
Zhao D, Wang B, Liu D (2013) A supervised actor-critic approach for adaptive cruise control. Soft Comput 17(11):2089–2099
Article MathSciNet Google Scholar
Zhao D, Bai X, Wang F, Xu J, Yu W (2011) DHP for coordinated freeway ramp metering. IEEE Trans Intell Transp Syst 12(4):990–999
Article Google Scholar
Bai X, Zhao D, Yi J (2009) The application of ADHDP$(\lambda )$ method to coordinated multiple ramps metering. Int J Innov Comput 5(10(B)):3471–3481
Google Scholar
Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49(2–3):209–232
Article MATH Google Scholar
Brafman RI, Tennenholtz M (2003) R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 3:213–231
MATH MathSciNet Google Scholar
Strehl AL, Littman ML (2005) A theoretical analysis of model-based interval estimation. In: Proceedings of 22nd international conference on machine learning (ICML’05), pp 856–863
Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. In: Proceedings of 23rd international conference on machine learning (ICML’06), pp 881–888
Kakade S, Kearns MJ, Langford J (2003) Exploration in metric state spaces. In: Proceedings of 20th international conference on machine learning (ICML’03), pp 306–312
Pazis J, Parr R (2013) PAC optimal exploration in continuous space markov decision processes. In: AAAI conference on artificial intelligence
Bernstein A, Shimkin N (2010) Adaptive-resolution reinforcement learning with polynomial exploration in deterministic domains. Mach Learn 81(3):359–397
Article MathSciNet Google Scholar
Munos R, Moore A (2002) Variable resolution discretization in optimal control. Mach Learn 49(2–3):291–323
Article MATH Google Scholar
Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J Mach Learn Res 6:503–556
MATH MathSciNet Google Scholar
Preparata FP, Shamos MI (1985) Computational geometry: an introduction. Springer, Berlin
Book Google Scholar
Li H, Liu D (2012) Optimal control for discrete-time affine nonlinear systems using general value iteration. IET Control Theory Appl 6(18):2725–2736
Article MathSciNet Google Scholar
Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. Trans Syst Man Cyber Part B 38(4):943–949
Article Google Scholar
Liu D, Yang X, Li H (2013) Adaptive optimal control for a class of continuous-time affine nonlinear systems with unknown internal dynamics. Neural Comput Appl 23(7–8):1843–1850
Article Google Scholar
Zuo L, Xu X, Liu C, Huang Z (2013) A hierarchical reinforcement learning approach for optimal path tracking of wheeled mobile robots. Neural Comput Appl 23(7–8):1873–1883
Article Google Scholar
Schoknecht R, Riedmiller M (2003) Reinforcement learning on explicitly specified time scales. Neural Comput Appl 12(2):61–80
Article Google Scholar
Neumann G (2005) The reinforcement learning toolbox: reinforcement learning for optimal control tasks. Master’s thesis, Technischen Universität (University of Technology) Graz

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (NSFC) under Grants No. 61273136, No. 61034002, and Beijing Natural Science Foundation under Grant No. 4122083.

Author information

Authors and Affiliations

The State Key Laboratory of Management and Control for Complex Systems, Institution of Automation, Chinese Academy of Sciences, Beijing, China
Yuanheng Zhu & Dongbin Zhao

Authors

Yuanheng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Dongbin Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongbin Zhao.

Appendix

Based on value iteration, to calculate ${\hat{Q}_t}$, we first initialize a function $\hat{Q}_t^{(0)}$ which can be assigned to any value. Usually, $\hat{Q}_t^{(0)}$ is equal to 0 or ${V_{\max }}$. Then, calculate the Q values of stored samples $(\hat{s},\hat{a},\hat{r},\hat{s}^\prime ) \in {D_t}$ by

$$\begin{aligned} \hat{q}_t^{(0)}(\hat{s},\hat{a}) = \hat{r} + \gamma \mathop {\max }\limits _{a^\prime } \hat{Q}_t^{(0)}(\hat{s}^\prime ,a^\prime ). \end{aligned}$$

Furthermore, a new $\hat{Q}_t^{(1)}$ can be obtained by

$$\begin{aligned} \hat{Q}_t^{(1)}(s,a) = \left\{ { \begin{array}{ll} {\mathop {\min }\limits _{{{\hat{s}}_i} \in {C_t}(s,a)} \hat{q}_t^{(0)}({{\hat{s}}_i},a)}, & {{\text {if}}\,\, {C_t}(s,a) \ne \varnothing }\\ {{V_{\max }}}, & {{\text {otherwise}}} \end{array}} \right. \end{aligned}$$

The above equation is totally equal to the process of calculating $\hat{Q}_t^{(1)}$ from $\hat{Q}_t^{(0)}$ by (5). Then, this calculation is iterated.

In conclusion, suppose we have $\hat{Q}_t^{(j)}$ of the j-th iteration, calculate Q values of stored samples using

$$\begin{aligned} \hat{q}_t^{(j)}(\hat{s},\hat{a}) = \hat{r} + \gamma \mathop {\max }\limits _{a^\prime } \hat{Q}_t^{(j)}(\hat{s}^\prime ,a^\prime ). \end{aligned}$$

Then, $\hat{Q}_t^{(j + 1)}$ at the $(j+1)$-th iteration is obtained by

$$\begin{aligned} \hat{Q}_t^{(j + 1)}(s,a) = \left\{ { \begin{array}{ll} {\mathop {\min }\limits _{{{\hat{s}}_i} \in {C_t}(s,a)} \hat{q}_t^{(j)}({{\hat{s}}_i},a)}, & {{\text {if}}\,\, {C_t}(s,a) \ne \varnothing }\\ {{V_{\max }}}, & {\text{otherwise}} \end{array}} \right. \end{aligned}$$

As the above process is a variant of solving (5) by value iteration, so it is convergent and the result is the same with directly calculating ${\hat{Q}_t}$ by value iteration. Moreover, in the process, the only need is storing Q values of samples, and the values of ${\hat{Q}_t}$ over the whole state space are easy to obtain.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Y., Zhao, D. A data-based online reinforcement learning algorithm satisfying probably approximately correct principle. Neural Comput & Applic 26, 775–787 (2015). https://doi.org/10.1007/s00521-014-1738-2

Download citation

Received: 14 May 2014
Accepted: 14 September 2014
Published: 30 October 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s00521-014-1738-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A data-based online reinforcement learning algorithm satisfying probably approximately correct principle

Abstract

Access this article

Similar content being viewed by others

A Survey: Limited Data Problem and Strategy of Reinforcement Learning

An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method

Reinforcement Learning Informed by Optimal Control

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A data-based online reinforcement learning algorithm satisfying probably approximately correct principle

Abstract

Access this article

Similar content being viewed by others

A Survey: Limited Data Problem and Strategy of Reinforcement Learning

An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method

Reinforcement Learning Informed by Optimal Control

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation