Abstract
Learning of the value function and the policy for continuous MDPs is non-trial due to the difficulty in collecting enough data. Model learning can use the collected data effectively, to learn a model and then use the learned model for planning, so as to accelerate the learning of the value function and the policy. Most of the existing works about model learning only concern the improvement of the single-step or multiple-step prediction, while the combination of them may be a better choice. Therefore, we propose an online algorithm where the samples for learning the model are both from the samples and from the trajectories, called Online-ML-ST. Other than the existing work, the trajectories collected in the interaction with the environment are not only used to learn the model offline, but also to learn the model, the value function and the policy online. The experiments are implemented in two typical continuous benchmarks such as the Pole Balancing and Inverted Pendulum, and the result shows that Online-ML-ST outperforms the other three typical methods in learning rate and convergence rate.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Busoniu L. Babuška R, Schutter BD et al (2010) Reinforcement Learning and dynamic programming using function approximators. CRC Press, New York
Grondman I, Busoniu L, Babuska R (2012a) Model learning actor-critic algorithms: performance evaluation in a motion control task. In: Proceedings of IEEE conference on decision and control, pp 5272–5277
Grondman I, Vaandrager M, Busoniu L et al (2012b) Efficient model learning methods for actor–critic control systems. IEEE Trans Syst Man Cybern 42:591–602
Hwangbo J, Sa I, Siegwart R et al (2017) Control of a quadrotor with reinforcement learning. IEEE Robot Auto Lett 2:2096–2103
Koushik AM, Hu F, Kumar S (2018) Intelligent spectrum management based on transfer actor-critic learning for rateless transmissions in cognitive radio networks. IEEE Trans Mob Comput 17:1204–1215. https://doi.org/10.1109/tmc.2017.2744620
Lample G, Chaplot DS (2017) Playing fps games with deep reinforcement learning. In: Proceedings of association for the advance of artificial intelligence, pp 2140–2146
Li L, Li D, Song T (2017) Sustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning. In: Proceedings of international conference on systems, man, and cybernetics, pp 1886–1891. https://doi.org/10.1109/smc.2017.8122892
Littman ML (2015) Reinforcement learning improves behaviour from evaluative feedback. Nature 7553:445–451. https://doi.org/10.1038/nature14540
Moore AW, Atkeson CG (1993) Prioritized sweeping: Reinforcement learning with less data and less real time. Mach Learn 1:103–130. https://doi.org/10.1007/bf00993104
Peng J, Williams RJ (1993) Efficient learning and planning within the Dyna framework. Adapt Behav 1:437–454. https://doi.org/10.1177/105971239300100403
Sombolestan SM, Rasooli A, Khodaygan S (2018) Optimal path-planning for mobile robots to find a hidden target in an unknown environment based on machine learning. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-0777-4
Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of international conference on machine learning, pp. 216–224
Sutton RS, Barto AG (1998) Introduction to reinforcement learning. MIT press, Cambridge
Sutton RS, Szepesvári C, Geramfard A et al (2008) Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of uncertainty in artificial intelligence, pp 1–9
Tagorti M, Scherer B (2015) On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of international conference on machine learning, pp 528–536
Venkatraman A, Hebert M, Bagnell JA (2015) Improving multi-step prediction of learned time series models. In: Proceedings of association for the advance of artificial intelligence, pp 3024–3030
Venkatraman A, Capobianco R, Pinto L et al (2016) Improved learning of dynamics models for control. In: Proceedings of advanced robotics, pp 703–713. https://doi.org/10.1007/978-3-319-50115-4_61
Wei Q, Song R, Yan P (2016) Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Trans Neural Netw Learn Syst 27:444–458. https://doi.org/10.1109/TNNLS.2015.2464080
Zang Z, Li Z, Dan Z et al (2018) Improving selection strategies in zeroth-level classifier systems based on average reward reinforcement learning. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0682-x
Zhong S, Liu Q, Zhang Z et al (2018) Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation. Front Comput Sci. https://doi.org/10.1007/s11704-017-6222-6
Zhu Y, Mottaghi R, Kolve E et al (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: Proceedings of IEEE international conference on robotics and automation, pp 3357–3364
Acknowledgements
This paper is supported by National Natural Science Foundation of China (61702055), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04). Program of Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhong, S., Fu, Q., Xia, K. et al. Online model-learning algorithm from samples and trajectories. J Ambient Intell Human Comput 11, 527–537 (2020). https://doi.org/10.1007/s12652-018-1133-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-018-1133-4