Value-Based Continuous Control Without Concrete State-Action Value Function

Zhu, Jin; Zhang, Haixian; Pan, Zhen

doi:10.1007/978-3-030-78811-7_34

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12690))

Included in the following conference series:

International Conference on Swarm Intelligence

1437 Accesses

Abstract

In the value-based reinforcement learning continuous control, it is apparent that actions with higher expected return (state-action value, also as Q) will be selected as the action decision. But limited by the expression of deep Q function, researchers mostly introduce an independent policy function for approximating the preference of Q function. These methods, named actor-critic, implement value-based continuous control in an effective but compromise way.

However, the policy function and the Q function are highly correlated in Maximum Entropy Reinforcement Learning, so that these two have a close-form solution on each other. By this fact, we propose to implement a value-based continuous control algorithm without concrete Q function, which infers a temporary Q function from policy when needed. Compare to the current maximum entropy actor-critic method, our method saves a Q network needing training and a step of policy optimization, which results in an advance in time efficiency, while remains state of art data efficiency in experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

Article Open access 02 March 2024

Recency-Weighted Acceleration for Continuous Control Through Deep Reinforcement Learning

PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control

References

Amos, B., Xu, L., Kolter, J.Z.: Input convex neural networks. In: International Conference on Machine Learning, PMLR, pp. 146–155 (2017)
Google Scholar
Badia, A.P., et al.: Agent57: outperforming the atari human benchmark. In: International Conference on Machine Learning, PMLR, pp. 507–517 (2020)
Google Scholar
Baird, L.C.: Reinforcement learning in continuous time: advantage updating. In: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 1994), vol. 4, pp. 2448–2453. IEEE (1994)
Google Scholar
Bellman, R.: The theory of dynamic programming. Technical report, Rand RAND Corporation Santa Monica, CA (1954)
Google Scholar
Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)
Degrave, J., Abdolmaleki, A., Springenberg, J.T., Heess, N., Riedmiller, M.A.: Quinoa: a q-function you infer normalized over actions. CoRR abs/1911.01831 (2019)
Google Scholar
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings. OpenReview.net (2017)
Google Scholar
Fox, R., Pakman, A., Tishby, N.: Taming the noise in reinforcement learning via soft updates. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 202–211 (2016)
Google Scholar
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, PMLR, pp. 1587–1596 (2018)
Google Scholar
Gaskett, C., Wettergreen, D., Zelinsky, A.: Q-learning in continuous state and action spaces. In: Australasian Joint Conference on Artificial Intelligence, pp. 417–428. Springer (1999)
Google Scholar
Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep q-learning with model-based acceleration. In: International Conference on Machine Learning, pp. 2829–2838 (2016)
Google Scholar
Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: International Conference on Machine Learning, PMLR, pp. 1352–1361 (2017)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, PMLR, pp. 1861–1870 (2018)
Google Scholar
Haarnoja, T., et al.: Soft actor-critic algorithms and applications. CoRR abs/1812.05905 (2018)
Google Scholar
Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp. 651–673 (2018)
Google Scholar
Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. Adv. Neural Inf. Process Syst. 20, 833–840 (2007)
Google Scholar
Lim, S.: Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces. Ph.D. thesis, University of Alberta (2019)
Google Scholar
Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose bayesian inference algorithm. In: Advances in Neural Information Processing Systems, pp. 2378–2386 (2016)
Google Scholar
Millán, J.D.R., Posenato, D., Dedieu, E.: Continuous-action q-learning. Mach. Learn. 49(2–3), 247–265 (2002)
Article Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Advances in Neural Information Processing Systems, pp. 9191–9200 (2018)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (2014)
MATH Google Scholar
Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., Levine, S.: Deep reinforcement learning for vision-based robotic grasping: a simulated comparative evaluation of off-policy methods. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6284–6291. IEEE (2018)
Google Scholar
Ryu, M., Chow, Y., Anderson, R., Tjandraatmadja, C., Boutilier, C.: CAQL: continuous action q-learning. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020)
Google Scholar
Schrittwieser, J., et al.: Mastering Atari, go, chess and Shogi by planning with a learned model. Nature 588(7839), 604–609 (2020)
Article Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference On Machine Learning, pp. 1889–1897 (2015)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: ICML pp. 903–910. Citeseer (2000)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
MATH Google Scholar
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Google Scholar
Uther, W.T., Veloso, M.M.: Tree based discretization for continuous state space reinforcement learning. In: Aaai/iaai, pp. 769–774 (1998)
Google Scholar
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
MATH Google Scholar
Ziebart, B.D.: Modeling purposeful adaptive behavior with the principle of maximum causal entropy (2010)
Google Scholar
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI, vol. 8, pp. 1433–1438, Chicago, IL, USA (2008)
Google Scholar

Download references

Acknowledgement

This work was supported by National Natural Science Foundation of China under Grant No.61836011.

Author information

Authors and Affiliations

Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China
Jin Zhu, Haixian Zhang & Zhen Pan

Authors

Jin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Haixian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haixian Zhang .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Ying Tan
Southern University of Science and Technology, Shenzhen, China
Yuhui Shi

A Experiment Details

We implement our algorithm with deep neural network as the universal approximator for the policy function and the value function, then introduce a common trick in value-based method, target network. Different from most continuous control method with Gaussian policy, our policy is parameterized as a Beta distribution conditioned on state. VCWCV’s hyperparameter setting is mostly from the SAC’s, more detail is on the Table 1.

Table 1. Hyperparameters of VCWCV and SAC

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, J., Zhang, H., Pan, Z. (2021). Value-Based Continuous Control Without Concrete State-Action Value Function. In: Tan, Y., Shi, Y. (eds) Advances in Swarm Intelligence. ICSI 2021. Lecture Notes in Computer Science(), vol 12690. Springer, Cham. https://doi.org/10.1007/978-3-030-78811-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-78811-7_34
Published: 07 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78810-0
Online ISBN: 978-3-030-78811-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Value-Based Continuous Control Without Concrete State-Action Value Function

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

Recency-Weighted Acceleration for Continuous Control Through Deep Reinforcement Learning

PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Experiment Details

A Experiment Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us