Generalizing Soft Actor-Critic Algorithms to Discrete Action Spaces

Zhang, Le; Gu, Yong; Zhao, Xin; Zhang, Yanshuo; Zhao, Shu; Jin, Yifei; Wu, Xinxin

doi:10.1007/978-981-97-8487-5_3

Le Zhang¹⁵,
Yong Gu¹⁶,
Xin Zhao¹⁵,
Yanshuo Zhang¹⁵,
Shu Zhao¹⁵,
Yifei Jin¹⁵ &
…
Xinxin Wu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15031))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

302 Accesses

Abstract

ATARI is a suite of video games used by reinforcement learning (RL) researchers to test the effectiveness of the learning algorithm. Receiving only the raw pixels and the game score, the agent learns to develop sophisticated strategies, even to the comparable level of a professional human games tester. Ideally, we also want an agent requiring very few interactions with the environment. Previous competitive model-free algorithms for the task use the valued-based Rainbow algorithm without any policy head. In this paper, we change it by proposing a practical discrete variant of the soft actor-critic (SAC) algorithm. The new variant enables off-policy learning using policy heads for discrete domains. By incorporating it into the advanced Rainbow variant, i.e., the “bigger, better, faster” (BBF), the resulting SAC-BBF improves the previous state-of-the-art interquartile mean (IQM) from 1.045 to 1.088, and it achieves these results using only replay ratio (RR) 2. By using lower RR 2, the training time of SAC-BBF is strictly one-third of the time required for BBF to achieve an IQM of 1.045 using RR 8. As a value of IQM greater than one indicates super-human performance, SAC-BBF is also the only model-free algorithm with a super-human level using only RR 2. The code is publicly available on GitHub at https://github.com/lezhang-thu/bigger-better-faster-SAC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Combine Deep Q-Networks with Actor-Critic

Exploration Bonuses Based on Upper Confidence Bounds for Sparse Reward Games

An Advanced Actor-Critic Algorithm for Training Video Game AI

Notes

1.
https://github.com/yining043/SAC-discrete.

References

Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A.C., Bellemare, M.: Deep reinforcement learning at the edge of the statistical precipice. Adv. Neural. Inf. Process. Syst. 34, 29304–29320 (2021)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Christodoulou, P.: Soft actor-critic for discrete action settings. arXiv:1910.07207 (2019)
D’Oro, P., Schwarzer, M., Nikishin, E., Bacon, P.L., Bellemare, M.G., Courville, A.: Sample-efficient reinforcement learning by breaking the replay ratio barrier. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al.: Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In: International Conference on Machine Learning, pp. 1407–1416. PMLR (2018)
Google Scholar
Gruslys, A., Dabney, W., Azar, M.G., Piot, B., Bellemare, M., Munos, R.: The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=rkHVZWZAZ
Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: International Conference on Machine Learning, pp. 1352–1361. PMLR (2017)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR (2018)
Google Scholar
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., Silver, D.: Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Łukasz Kaiser, Babaeizadeh, M., Miłos, P., Osiński, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model based reinforcement learning for atari. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=S1xCPJHtDB
Kielak, K.P.: Do recent advancements in model-based deep reinforcement learning really improve data efficiency?, p. 9 (2020). https://openreview.net/forum (2019)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 (2013)
Laskin, M., Srinivas, A., Abbeel, P.: Curl: Contrastive unsupervised representations for reinforcement learning. In: International Conference on Machine Learning, pp. 5639–5650. PMLR (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Micheli, V., Alonso, E., Fleuret, F.: Transformers are sample-efficient world models. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=vhFu1Acb0xb
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PMLR (2016)
Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Nikishin, E., Schwarzer, M., D’Oro, P., Bacon, P.L., Courville, A.: The primacy bias in deep reinforcement learning. In: International Conference on Machine Learning, pp. 16828–16847. PMLR (2022)
Google Scholar
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839), 604–609 (2020)
Article Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)
Schwarzer, M., Anand, A., Goel, R., Hjelm, R.D., Courville, A., Bachman, P.: Data-efficient reinforcement learning with self-predictive representations. In: International Conference on Learning Representations (2020)
Google Scholar
Schwarzer, M., Ceron, J.S.O., Courville, A., Bellemare, M.G., Agarwal, R., Castro, P.S.: Bigger, better, faster: Human-level atari with human-level efficiency. In: International Conference on Machine Learning, pp. 30365–30380. PMLR (2023)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press (2018)
Google Scholar
Van Hasselt, H.P., Hessel, M., Aslanides, J.: When to use parametric models in reinforcement learning? Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Freitas, N.: Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations (2017). https://openreview.net/forum?id=HyM25Mqel
Xu, M., Quiroz, M., Kohn, R., Sisson, S.A.: Variance reduction properties of the reparameterization trick. In: Chaudhuri, K., Sugiyama, M. (eds.) Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 89, pp. 2711–2720. PMLR. Accessed 16–18 April 2019. https://proceedings.mlr.press/v89/xu19a.html
Xu, Y., Hu, D., Liang, L., McAleer, S.M., Abbeel, P., Fox, R.: Target entropy annealing for discrete soft actor-critic. In: Deep RL Workshop NeurIPS (2021)
Google Scholar
Yarats, D., Kostrikov, I., Fergus, R.: Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In: International Conference on Learning Representations (2020)
Google Scholar
Ye, W., Liu, S., Kurutach, T., Abbeel, P., Gao, Y.: Mastering atari games with limited data. Adv. Neural. Inf. Process. Syst. 34, 25476–25488 (2021)
Google Scholar

Download references

Acknowledgement:

Le Zhang is supported by the Fundamental Research Funds for the Central Universities (Grant No. 3282023011 and 3282023053). Yong Gu is supported by National Natural Science Foundation of China Project (Grant No. 62262023). Xin Zhao is supported by the National Natural Science Foundation of China (Grant No. 12201015). Yanshuo Zhang is supported by the Natural Science Foundation of Beijing (Grant No. 4232034).

Author information

Authors and Affiliations

Department of Cybersecurity, Beijing Electronic Science and Technology Institute, Beijing, China
Le Zhang, Xin Zhao, Yanshuo Zhang, Shu Zhao, Yifei Jin & Xinxin Wu
School of Software and Internet of Things Engineering, JiangXi University of Finance and Economics, Nanchang, China
Yong Gu

Authors

Le Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Gu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yanshuo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yifei Jin
View author publications
You can also search for this author in PubMed Google Scholar
Xinxin Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Le Zhang .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Urumqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Urumqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, L. et al. (2025). Generalizing Soft Actor-Critic Algorithms to Discrete Action Spaces. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15031. Springer, Singapore. https://doi.org/10.1007/978-981-97-8487-5_3

Download citation

DOI: https://doi.org/10.1007/978-981-97-8487-5_3
Published: 04 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8486-8
Online ISBN: 978-981-97-8487-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics