Abstract
In the value-based reinforcement learning continuous control, it is apparent that actions with higher expected return (state-action value, also as Q) will be selected as the action decision. But limited by the expression of deep Q function, researchers mostly introduce an independent policy function for approximating the preference of Q function. These methods, named actor-critic, implement value-based continuous control in an effective but compromise way.
However, the policy function and the Q function are highly correlated in Maximum Entropy Reinforcement Learning, so that these two have a close-form solution on each other. By this fact, we propose to implement a value-based continuous control algorithm without concrete Q function, which infers a temporary Q function from policy when needed. Compare to the current maximum entropy actor-critic method, our method saves a Q network needing training and a step of policy optimization, which results in an advance in time efficiency, while remains state of art data efficiency in experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amos, B., Xu, L., Kolter, J.Z.: Input convex neural networks. In: International Conference on Machine Learning, PMLR, pp. 146ā155 (2017)
Badia, A.P., et al.: Agent57: outperforming the atari human benchmark. In: International Conference on Machine Learning, PMLR, pp. 507ā517 (2020)
Baird, L.C.: Reinforcement learning in continuous time: advantage updating. In: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 1994), vol. 4, pp. 2448ā2453. IEEE (1994)
Bellman, R.: The theory of dynamic programming. Technical report, Rand RAND Corporation Santa Monica, CA (1954)
Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)
Degrave, J., Abdolmaleki, A., Springenberg, J.T., Heess, N., Riedmiller, M.A.: Quinoa: a q-function you infer normalized over actions. CoRR abs/1911.01831 (2019)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24ā26 April 2017, Conference Track Proceedings. OpenReview.net (2017)
Fox, R., Pakman, A., Tishby, N.: Taming the noise in reinforcement learning via soft updates. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 202ā211 (2016)
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, PMLR, pp. 1587ā1596 (2018)
Gaskett, C., Wettergreen, D., Zelinsky, A.: Q-learning in continuous state and action spaces. In: Australasian Joint Conference on Artificial Intelligence, pp. 417ā428. Springer (1999)
Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep q-learning with model-based acceleration. In: International Conference on Machine Learning, pp. 2829ā2838 (2016)
Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: International Conference on Machine Learning, PMLR, pp. 1352ā1361 (2017)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, PMLR, pp. 1861ā1870 (2018)
Haarnoja, T., et al.: Soft actor-critic algorithms and applications. CoRR abs/1812.05905 (2018)
Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp. 651ā673 (2018)
Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. Adv. Neural Inf. Process Syst. 20, 833ā840 (2007)
Lim, S.: Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces. Ph.D. thesis, University of Alberta (2019)
Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose bayesian inference algorithm. In: Advances in Neural Information Processing Systems, pp. 2378ā2386 (2016)
MillĆ”n, J.D.R., Posenato, D., Dedieu, E.: Continuous-action q-learning. Mach. Learn. 49(2ā3), 247ā265 (2002)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529ā533 (2015)
Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Advances in Neural Information Processing Systems, pp. 9191ā9200 (2018)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (2014)
Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., Levine, S.: Deep reinforcement learning for vision-based robotic grasping: a simulated comparative evaluation of off-policy methods. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6284ā6291. IEEE (2018)
Ryu, M., Chow, Y., Anderson, R., Tjandraatmadja, C., Boutilier, C.: CAQL: continuous action q-learning. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26ā30 April 2020. OpenReview.net (2020)
Schrittwieser, J., et al.: Mastering Atari, go, chess and Shogi by planning with a learned model. Nature 588(7839), 604ā609 (2020)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference On Machine Learning, pp. 1889ā1897 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354ā359 (2017)
Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: ICML pp. 903ā910. Citeseer (2000)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057ā1063 (2000)
Uther, W.T., Veloso, M.M.: Tree based discretization for continuous state space reinforcement learning. In: Aaai/iaai, pp. 769ā774 (1998)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3ā4), 279ā292 (1992)
Ziebart, B.D.: Modeling purposeful adaptive behavior with the principle of maximum causal entropy (2010)
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI, vol. 8, pp. 1433ā1438, Chicago, IL, USA (2008)
Acknowledgement
This work was supported by National Natural Science Foundation of China under Grant No.61836011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Experiment Details
A Experiment Details
We implement our algorithm with deep neural network as the universal approximator for the policy function and the value function, then introduce a common trick in value-based method, target network. Different from most continuous control method with Gaussian policy, our policy is parameterized as a Beta distribution conditioned on state. VCWCVās hyperparameter setting is mostly from the SACās, more detail is on the Table 1.
Rights and permissions
Copyright information
Ā© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, J., Zhang, H., Pan, Z. (2021). Value-Based Continuous Control Without Concrete State-Action Value Function. In: Tan, Y., Shi, Y. (eds) Advances in Swarm Intelligence. ICSI 2021. Lecture Notes in Computer Science(), vol 12690. Springer, Cham. https://doi.org/10.1007/978-3-030-78811-7_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-78811-7_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78810-0
Online ISBN: 978-3-030-78811-7
eBook Packages: Computer ScienceComputer Science (R0)