Skip to main content
Log in

A data-based online reinforcement learning algorithm satisfying probably approximately correct principle

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This paper proposes a probably approximately correct (PAC) algorithm that directly utilizes online data efficiently to solve the optimal control problem of continuous deterministic systems without system parameters for the first time. The dependence on some specific approximation structures is crucial to limit the wide application of online reinforcement learning (RL) algorithms. We utilize the online data directly with the kd-tree technique to remove this limitation. Moreover, we design the algorithm in the PAC principle. Complete theoretical proofs are presented, and three examples are simulated to verify its good performance. It draws the conclusion that the proposed RL algorithm specifies the maximum running time to reach a near-optimal control policy with only online data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

    Google Scholar 

  2. Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators. CRC Press, New York

    Book  Google Scholar 

  3. Tan AH, Ong YS, Tapanuj A (2011) A hybrid agent architecture integrating desire, intention and reinforcement learning. Expert Syst Appl 38(7):8477–8487

    Article  Google Scholar 

  4. Tang L, Liu Y-J, Tong S (2014) Adaptive neural control using reinforcement learning for a class of robot manipulator. Neural Comput Appl 25(1):135–141

  5. Wang D, Liu D, Zhao D, Huang Y, Zhang D (2013) A neural-network-based iterative GDHP approach for solving a class of nonlinear optimal control problems with control constraints. Neural Comput Appl 22(2):219–227

    Article  Google Scholar 

  6. Wei Q, Liu D (2014) Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems. Neural Comput Appl 24(6):1355–1367

    Article  Google Scholar 

  7. Wang B, Zhao D, Alippi C, Liu D (2014) Dual heuristic dynamic programming for nonlinear discrete-time uncertain systems with state delay. Neurocomputing 134:222–229

    Article  Google Scholar 

  8. Watkins C (1989) Learning from delayed rewards. PhD thesis, Cambridge University, Cambridge

  9. ten Hagen S, Kröse B (2003) Neural Q-learning. Neural Comput Appl 12(2):81–88

    Article  Google Scholar 

  10. Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist systems. Tech. Rep. TR 166, Cambridge University Engineering Department, Cambridge, England

  11. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming. IEEE Trans Autom Sci Eng 9(3):628–634

    Article  Google Scholar 

  12. Thrun SB (1992) The role of exploration in learning control. In: White D, Sofge D (eds) Handbook for intelligent control: neural, fuzzy and adaptive approaches. Van Nostrand Reinhold, Florence, Kentucky 41022

  13. Zhao D, Hu Z, Xia Z, Alippi C, Wang D (2014) Full range adaptive cruise control based on supervised adaptive dynamic programming. Neurocomputing 125:57–67

    Article  Google Scholar 

  14. Zhao D, Wang B, Liu D (2013) A supervised actor-critic approach for adaptive cruise control. Soft Comput 17(11):2089–2099

    Article  MathSciNet  Google Scholar 

  15. Zhao D, Bai X, Wang F, Xu J, Yu W (2011) DHP for coordinated freeway ramp metering. IEEE Trans Intell Transp Syst 12(4):990–999

    Article  Google Scholar 

  16. Bai X, Zhao D, Yi J (2009) The application of ADHDP\((\lambda )\) method to coordinated multiple ramps metering. Int J Innov Comput 5(10(B)):3471–3481

    Google Scholar 

  17. Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49(2–3):209–232

    Article  MATH  Google Scholar 

  18. Brafman RI, Tennenholtz M (2003) R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 3:213–231

    MATH  MathSciNet  Google Scholar 

  19. Strehl AL, Littman ML (2005) A theoretical analysis of model-based interval estimation. In: Proceedings of 22nd international conference on machine learning (ICML’05), pp 856–863

  20. Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. In: Proceedings of 23rd international conference on machine learning (ICML’06), pp 881–888

  21. Kakade S, Kearns MJ, Langford J (2003) Exploration in metric state spaces. In: Proceedings of 20th international conference on machine learning (ICML’03), pp 306–312

  22. Pazis J, Parr R (2013) PAC optimal exploration in continuous space markov decision processes. In: AAAI conference on artificial intelligence

  23. Bernstein A, Shimkin N (2010) Adaptive-resolution reinforcement learning with polynomial exploration in deterministic domains. Mach Learn 81(3):359–397

    Article  MathSciNet  Google Scholar 

  24. Munos R, Moore A (2002) Variable resolution discretization in optimal control. Mach Learn 49(2–3):291–323

    Article  MATH  Google Scholar 

  25. Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J Mach Learn Res 6:503–556

    MATH  MathSciNet  Google Scholar 

  26. Preparata FP, Shamos MI (1985) Computational geometry: an introduction. Springer, Berlin

    Book  Google Scholar 

  27. Li H, Liu D (2012) Optimal control for discrete-time affine nonlinear systems using general value iteration. IET Control Theory Appl 6(18):2725–2736

    Article  MathSciNet  Google Scholar 

  28. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. Trans Syst Man Cyber Part B 38(4):943–949

    Article  Google Scholar 

  29. Liu D, Yang X, Li H (2013) Adaptive optimal control for a class of continuous-time affine nonlinear systems with unknown internal dynamics. Neural Comput Appl 23(7–8):1843–1850

    Article  Google Scholar 

  30. Zuo L, Xu X, Liu C, Huang Z (2013) A hierarchical reinforcement learning approach for optimal path tracking of wheeled mobile robots. Neural Comput Appl 23(7–8):1873–1883

    Article  Google Scholar 

  31. Schoknecht R, Riedmiller M (2003) Reinforcement learning on explicitly specified time scales. Neural Comput Appl 12(2):61–80

    Article  Google Scholar 

  32. Neumann G (2005) The reinforcement learning toolbox: reinforcement learning for optimal control tasks. Master’s thesis, Technischen Universität (University of Technology) Graz

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (NSFC) under Grants No. 61273136, No. 61034002, and Beijing Natural Science Foundation under Grant No. 4122083.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongbin Zhao.

Appendix

Appendix

Based on value iteration, to calculate \({\hat{Q}_t}\), we first initialize a function \(\hat{Q}_t^{(0)}\) which can be assigned to any value. Usually, \(\hat{Q}_t^{(0)}\) is equal to 0 or \({V_{\max }}\). Then, calculate the Q values of stored samples \((\hat{s},\hat{a},\hat{r},\hat{s}^\prime ) \in {D_t}\) by

$$\begin{aligned} \hat{q}_t^{(0)}(\hat{s},\hat{a}) = \hat{r} + \gamma \mathop {\max }\limits _{a^\prime } \hat{Q}_t^{(0)}(\hat{s}^\prime ,a^\prime ). \end{aligned}$$

Furthermore, a new \(\hat{Q}_t^{(1)}\) can be obtained by

$$\begin{aligned} \hat{Q}_t^{(1)}(s,a) = \left\{ { \begin{array}{ll} {\mathop {\min }\limits _{{{\hat{s}}_i} \in {C_t}(s,a)} \hat{q}_t^{(0)}({{\hat{s}}_i},a)}, & {{\text {if}}\,\, {C_t}(s,a) \ne \varnothing }\\ {{V_{\max }}}, & {{\text {otherwise}}} \end{array}} \right. \end{aligned}$$

The above equation is totally equal to the process of calculating \(\hat{Q}_t^{(1)}\) from \(\hat{Q}_t^{(0)}\) by (5). Then, this calculation is iterated.

In conclusion, suppose we have \(\hat{Q}_t^{(j)}\) of the j-th iteration, calculate Q values of stored samples using

$$\begin{aligned} \hat{q}_t^{(j)}(\hat{s},\hat{a}) = \hat{r} + \gamma \mathop {\max }\limits _{a^\prime } \hat{Q}_t^{(j)}(\hat{s}^\prime ,a^\prime ). \end{aligned}$$

Then, \(\hat{Q}_t^{(j + 1)}\) at the \((j+1)\)-th iteration is obtained by

$$\begin{aligned} \hat{Q}_t^{(j + 1)}(s,a) = \left\{ { \begin{array}{ll} {\mathop {\min }\limits _{{{\hat{s}}_i} \in {C_t}(s,a)} \hat{q}_t^{(j)}({{\hat{s}}_i},a)}, & {{\text {if}}\,\, {C_t}(s,a) \ne \varnothing }\\ {{V_{\max }}}, & {\text{otherwise}} \end{array}} \right. \end{aligned}$$

As the above process is a variant of solving (5) by value iteration, so it is convergent and the result is the same with directly calculating \({\hat{Q}_t}\) by value iteration. Moreover, in the process, the only need is storing Q values of samples, and the values of \({\hat{Q}_t}\) over the whole state space are easy to obtain.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Y., Zhao, D. A data-based online reinforcement learning algorithm satisfying probably approximately correct principle. Neural Comput & Applic 26, 775–787 (2015). https://doi.org/10.1007/s00521-014-1738-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-014-1738-2

Keywords

Navigation