Skip to main content

Value-Based Continuous Control Without Concrete State-Action Value Function

  • Conference paper
  • First Online:
Advances in Swarm Intelligence (ICSI 2021)

Abstract

In the value-based reinforcement learning continuous control, it is apparent that actions with higher expected return (state-action value, also as Q) will be selected as the action decision. But limited by the expression of deep Q function, researchers mostly introduce an independent policy function for approximating the preference of Q function. These methods, named actor-critic, implement value-based continuous control in an effective but compromise way.

However, the policy function and the Q function are highly correlated in Maximum Entropy Reinforcement Learning, so that these two have a close-form solution on each other. By this fact, we propose to implement a value-based continuous control algorithm without concrete Q function, which infers a temporary Q function from policy when needed. Compare to the current maximum entropy actor-critic method, our method saves a Q network needing training and a step of policy optimization, which results in an advance in time efficiency, while remains state of art data efficiency in experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amos, B., Xu, L., Kolter, J.Z.: Input convex neural networks. In: International Conference on Machine Learning, PMLR, pp. 146ā€“155 (2017)

    Google Scholar 

  2. Badia, A.P., et al.: Agent57: outperforming the atari human benchmark. In: International Conference on Machine Learning, PMLR, pp. 507ā€“517 (2020)

    Google Scholar 

  3. Baird, L.C.: Reinforcement learning in continuous time: advantage updating. In: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 1994), vol. 4, pp. 2448ā€“2453. IEEE (1994)

    Google Scholar 

  4. Bellman, R.: The theory of dynamic programming. Technical report, Rand RAND Corporation Santa Monica, CA (1954)

    Google Scholar 

  5. Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)

  6. Degrave, J., Abdolmaleki, A., Springenberg, J.T., Heess, N., Riedmiller, M.A.: Quinoa: a q-function you infer normalized over actions. CoRR abs/1911.01831 (2019)

    Google Scholar 

  7. Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24ā€“26 April 2017, Conference Track Proceedings. OpenReview.net (2017)

    Google Scholar 

  8. Fox, R., Pakman, A., Tishby, N.: Taming the noise in reinforcement learning via soft updates. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 202ā€“211 (2016)

    Google Scholar 

  9. Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, PMLR, pp. 1587ā€“1596 (2018)

    Google Scholar 

  10. Gaskett, C., Wettergreen, D., Zelinsky, A.: Q-learning in continuous state and action spaces. In: Australasian Joint Conference on Artificial Intelligence, pp. 417ā€“428. Springer (1999)

    Google Scholar 

  11. Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep q-learning with model-based acceleration. In: International Conference on Machine Learning, pp. 2829ā€“2838 (2016)

    Google Scholar 

  12. Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: International Conference on Machine Learning, PMLR, pp. 1352ā€“1361 (2017)

    Google Scholar 

  13. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, PMLR, pp. 1861ā€“1870 (2018)

    Google Scholar 

  14. Haarnoja, T., et al.: Soft actor-critic algorithms and applications. CoRR abs/1812.05905 (2018)

    Google Scholar 

  15. Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp. 651ā€“673 (2018)

    Google Scholar 

  16. Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. Adv. Neural Inf. Process Syst. 20, 833ā€“840 (2007)

    Google Scholar 

  17. Lim, S.: Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces. Ph.D. thesis, University of Alberta (2019)

    Google Scholar 

  18. Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose bayesian inference algorithm. In: Advances in Neural Information Processing Systems, pp. 2378ā€“2386 (2016)

    Google Scholar 

  19. MillĆ”n, J.D.R., Posenato, D., Dedieu, E.: Continuous-action q-learning. Mach. Learn. 49(2ā€“3), 247ā€“265 (2002)

    Article  Google Scholar 

  20. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529ā€“533 (2015)

    Article  Google Scholar 

  21. Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Advances in Neural Information Processing Systems, pp. 9191ā€“9200 (2018)

    Google Scholar 

  22. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (2014)

    MATH  Google Scholar 

  23. Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., Levine, S.: Deep reinforcement learning for vision-based robotic grasping: a simulated comparative evaluation of off-policy methods. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6284ā€“6291. IEEE (2018)

    Google Scholar 

  24. Ryu, M., Chow, Y., Anderson, R., Tjandraatmadja, C., Boutilier, C.: CAQL: continuous action q-learning. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26ā€“30 April 2020. OpenReview.net (2020)

    Google Scholar 

  25. Schrittwieser, J., et al.: Mastering Atari, go, chess and Shogi by planning with a learned model. Nature 588(7839), 604ā€“609 (2020)

    Article  Google Scholar 

  26. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference On Machine Learning, pp. 1889ā€“1897 (2015)

    Google Scholar 

  27. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  28. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354ā€“359 (2017)

    Article  Google Scholar 

  29. Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: ICML pp. 903ā€“910. Citeseer (2000)

    Google Scholar 

  30. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  31. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057ā€“1063 (2000)

    Google Scholar 

  32. Uther, W.T., Veloso, M.M.: Tree based discretization for continuous state space reinforcement learning. In: Aaai/iaai, pp. 769ā€“774 (1998)

    Google Scholar 

  33. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

    Google Scholar 

  34. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3ā€“4), 279ā€“292 (1992)

    MATH  Google Scholar 

  35. Ziebart, B.D.: Modeling purposeful adaptive behavior with the principle of maximum causal entropy (2010)

    Google Scholar 

  36. Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI, vol. 8, pp. 1433ā€“1438, Chicago, IL, USA (2008)

    Google Scholar 

Download references

Acknowledgement

This work was supported by National Natural Science Foundation of China under Grant No.61836011.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haixian Zhang .

Editor information

Editors and Affiliations

A Experiment Details

A Experiment Details

We implement our algorithm with deep neural network as the universal approximator for the policy function and the value function, then introduce a common trick in value-based method, target network. Different from most continuous control method with Gaussian policy, our policy is parameterized as a Beta distribution conditioned on state. VCWCVā€™s hyperparameter setting is mostly from the SACā€™s, more detail is on the Table 1.

Table 1. Hyperparameters of VCWCV and SAC

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, J., Zhang, H., Pan, Z. (2021). Value-Based Continuous Control Without Concrete State-Action Value Function. In: Tan, Y., Shi, Y. (eds) Advances in Swarm Intelligence. ICSI 2021. Lecture Notes in Computer Science(), vol 12690. Springer, Cham. https://doi.org/10.1007/978-3-030-78811-7_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-78811-7_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78810-0

  • Online ISBN: 978-3-030-78811-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics