Abstract
The optimization of Bellman error is the key to value function learning in principle. However, it always suffers from unstable training and slow convergence. In this paper, we investigate the problem of optimizing Bellman error distribution, aiming at stabilizing the process of Bellman error training. Then, we propose a framework that the Bellman error distribution at the current time approximates the previous one, under the hypothesis that the Bellman error follows a stationary random process if the training process is convergent, which can stabilize the value function learning. Next, we minimize the distance of two distributions with the Stein Variational Gradient Descend (SVGD) method, which benefits the balance of exploration and exploitation in parameter space. Then, we incorporate this framework in the advantage actor-critic (A2C) algorithms. Experimental results on discrete control problems, show our algorithm getting better average returns and smaller Bellman errors than both A2C algorithms and anchor method. Besides, it would stabilize the training process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)
Espeholt, L., Soyer, H., Munos, et al.: Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561 (2018)
Hessel, M., Modayil, J., Hasselt, V., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in neural information processing systems, pp. 1008–1014 (2000)
Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose Bayesian inference algorithm. Adv. Neural Inf. Process. Syst. 29, 2378–2386 (2016)
Liu, Q., Wang, D.: Stein variational gradient descent as moment matching. Adv. Neural Inf. Process. Syst. 31, 8854–8863 (2018)
Liu, Y., et al.: Stein Variational Policy Gradient. arXiv:1704.02399 (2017)
Mnih, V., Badia, A.P., Mirza, M., et al.: Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp. 1928–1937 (2016)
Mnih, V., Kavukcuoglu, K., Silver, et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015). https://doi.org/10.1038/nature14236
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Nachum, O., Gu, S.S., Lee, H., Levine, S.: Data-efficient hierarchical reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 3303–3313 (2018)
Pearce, T., Anastassacos, N., Zaki, M., Neely, A.: Bayesian inference with anchored ensembles of neural networks, and application to reinforcement learning. arXiv preprint arXiv:1805.11324 (2018)
Pearce, T., Zaki, M., Brintrup, A., Anastassacos, N., Neely, A.: Uncertainty in neural networks: Bayesian ensembling. arXiv preprint arXiv:1810.05546 (2018)
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International conference on machine learning, pp. 1889–1897 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms (2014)
Tang, J., Qu, M., Wang, M., et al.: Line: Large-scale information network embedding. In: Proceedings of the 24th international conference on world wide web, pp. 1067–1077 (2015)
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.: Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Gong, C., Bai, Y., Hou, X., Ji, X. (2020). Stable Training of Bellman Error in Reinforcement Learning. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1333. Springer, Cham. https://doi.org/10.1007/978-3-030-63823-8_51
Download citation
DOI: https://doi.org/10.1007/978-3-030-63823-8_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63822-1
Online ISBN: 978-3-030-63823-8
eBook Packages: Computer ScienceComputer Science (R0)