Skip to main content
Log in

Online model-learning algorithm from samples and trajectories

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Learning of the value function and the policy for continuous MDPs is non-trial due to the difficulty in collecting enough data. Model learning can use the collected data effectively, to learn a model and then use the learned model for planning, so as to accelerate the learning of the value function and the policy. Most of the existing works about model learning only concern the improvement of the single-step or multiple-step prediction, while the combination of them may be a better choice. Therefore, we propose an online algorithm where the samples for learning the model are both from the samples and from the trajectories, called Online-ML-ST. Other than the existing work, the trajectories collected in the interaction with the environment are not only used to learn the model offline, but also to learn the model, the value function and the policy online. The experiments are implemented in two typical continuous benchmarks such as the Pole Balancing and Inverted Pendulum, and the result shows that Online-ML-ST outperforms the other three typical methods in learning rate and convergence rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Busoniu L. Babuška R, Schutter BD et al (2010) Reinforcement Learning and dynamic programming using function approximators. CRC Press, New York

    Google Scholar 

  • Grondman I, Busoniu L, Babuska R (2012a) Model learning actor-critic algorithms: performance evaluation in a motion control task. In: Proceedings of IEEE conference on decision and control, pp 5272–5277

  • Grondman I, Vaandrager M, Busoniu L et al (2012b) Efficient model learning methods for actor–critic control systems. IEEE Trans Syst Man Cybern 42:591–602

    Article  Google Scholar 

  • Hwangbo J, Sa I, Siegwart R et al (2017) Control of a quadrotor with reinforcement learning. IEEE Robot Auto Lett 2:2096–2103

    Article  Google Scholar 

  • Koushik AM, Hu F, Kumar S (2018) Intelligent spectrum management based on transfer actor-critic learning for rateless transmissions in cognitive radio networks. IEEE Trans Mob Comput 17:1204–1215. https://doi.org/10.1109/tmc.2017.2744620

    Article  Google Scholar 

  • Lample G, Chaplot DS (2017) Playing fps games with deep reinforcement learning. In: Proceedings of association for the advance of artificial intelligence, pp 2140–2146

  • Li L, Li D, Song T (2017) Sustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning. In: Proceedings of international conference on systems, man, and cybernetics, pp 1886–1891. https://doi.org/10.1109/smc.2017.8122892

  • Littman ML (2015) Reinforcement learning improves behaviour from evaluative feedback. Nature 7553:445–451. https://doi.org/10.1038/nature14540

    Article  Google Scholar 

  • Moore AW, Atkeson CG (1993) Prioritized sweeping: Reinforcement learning with less data and less real time. Mach Learn 1:103–130. https://doi.org/10.1007/bf00993104

    Article  Google Scholar 

  • Peng J, Williams RJ (1993) Efficient learning and planning within the Dyna framework. Adapt Behav 1:437–454. https://doi.org/10.1177/105971239300100403

    Article  Google Scholar 

  • Sombolestan SM, Rasooli A, Khodaygan S (2018) Optimal path-planning for mobile robots to find a hidden target in an unknown environment based on machine learning. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-0777-4

    Article  Google Scholar 

  • Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of international conference on machine learning, pp. 216–224

    Chapter  Google Scholar 

  • Sutton RS, Barto AG (1998) Introduction to reinforcement learning. MIT press, Cambridge

    Book  Google Scholar 

  • Sutton RS, Szepesvári C, Geramfard A et al (2008) Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of uncertainty in artificial intelligence, pp 1–9

  • Tagorti M, Scherer B (2015) On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of international conference on machine learning, pp 528–536

  • Venkatraman A, Hebert M, Bagnell JA (2015) Improving multi-step prediction of learned time series models. In: Proceedings of association for the advance of artificial intelligence, pp 3024–3030

  • Venkatraman A, Capobianco R, Pinto L et al (2016) Improved learning of dynamics models for control. In: Proceedings of advanced robotics, pp 703–713. https://doi.org/10.1007/978-3-319-50115-4_61

    Google Scholar 

  • Wei Q, Song R, Yan P (2016) Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Trans Neural Netw Learn Syst 27:444–458. https://doi.org/10.1109/TNNLS.2015.2464080

    Article  MathSciNet  Google Scholar 

  • Zang Z, Li Z, Dan Z et al (2018) Improving selection strategies in zeroth-level classifier systems based on average reward reinforcement learning. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0682-x

    Article  Google Scholar 

  • Zhong S, Liu Q, Zhang Z et al (2018) Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation. Front Comput Sci. https://doi.org/10.1007/s11704-017-6222-6

    Article  Google Scholar 

  • Zhu Y, Mottaghi R, Kolve E et al (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: Proceedings of IEEE international conference on robotics and automation, pp 3357–3364

Download references

Acknowledgements

This paper is supported by National Natural Science Foundation of China (61702055), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04). Program of Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shan Zhong.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhong, S., Fu, Q., Xia, K. et al. Online model-learning algorithm from samples and trajectories. J Ambient Intell Human Comput 11, 527–537 (2020). https://doi.org/10.1007/s12652-018-1133-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-018-1133-4

Keywords

Navigation