Abstract
Reinforcement learning (RL) algorithms have been demonstrated to solve a variety of continuous control tasks. However, the training efficiency and performance of such methods limit further applications. In this paper, we propose an off-policy heterogeneous actor-critic (HAC) algorithm, which contains soft Q-function and ordinary Q-function. The soft Q-function encourages the exploration of a Gaussian policy, and the ordinary Q-function optimizes the mean of the Gaussian policy to improve the training efficiency. Experience replay memory is another vital component of off-policy RL methods. We propose a new sampling technique that emphasizes recently experienced transitions to boost the policy training. Besides, we integrate HAC with hindsight experience replay (HER) to deal with sparse reward tasks, which are common in the robotic manipulation domain. Finally, we evaluate our methods on a series of continuous control benchmark tasks and robotic manipulation tasks. The experimental results show that our method outperforms prior state-of-the-art methods in terms of training efficiency and performance, which validates the effectiveness of our method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, vol. 529, no. 7587, pp. 484–489, 2016. DOI:https://doi.org/10.1038/nature16961.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature, vol. 518, no. 7540, pp. 529–533, 2015. DOI:https://doi.org/10.1038/nature14236.
S. X. Gu, E. Holly, T. Lillicrap, S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Singapore, pp. 3389–3396, 2017. DOI:https://doi.org/10.1109/ICRA.2017.7989385.
M. Y. Zhang, G. H. Tian, C. C. Li, J. Gong. Learning to transform service instructions into actions with reinforcement learning and knowledge base. International Journal of Automation and Computing, vol. 15, no. 5, pp. 582–592, 2018. DOI:https://doi.org/10.1007/s11633-018-1128-9.
Z. Li, S. R. Xue, X. H. Yu, H. J. Gao. Controller optimization for multirate systems based on reinforcement learning. International Journal of Automation and Computing, vol. 17, no. 3, pp. 417–427, 2020. DOI:https://doi.org/10.1007/s11633-020-1229-0.
Y. P. Luo, H. Z. Xu, Y. Z. Li, Y. D. Tian, T. Darrell, T. Y. Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In Proceedings of International Conference on Learning Representations, New Orleans, USA, 2019.
T. Kurutach, I. Clavera, Y. Duan, A Tamar, P. Abbeel. Model-ensemble trust-region policy optimization. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, P. Abbeel. Model-based reinforcement learning via meta-policy optimization. In Proceedings of the 2nd Conference on Robot Learning, Zurich, Switzerland, pp.617–629, 2018.
Q. Xiao, Z. C. Cao, M. C. Zhou. Learning locomotion skills via model-based proximal meta-reinforcement learning. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, IEEE, Bari, Italy, pp. 1545–1550, 2019. DOI:https://doi.org/10.1109/SMC.2019.8914406.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
S. Fujimoto, H. Van Hoof, D. Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 1587–1596, 2018.
H. Van Hasselt, A. Guez, D. Silver. Deep reinforcement learning with double Q-Learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, pp. 2094–2100, 2016.
J. Wu, R. Wang, R. Y. Li, H. Zhang, X. H. Hu. Multi-critic DDPG method and double experience replay. In Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, IEEE, Miyazaki, Japan, pp. 165–171, 2018. DOI:https://doi.org/10.1109/SMC.2018.00039.
Z. B. Zheng, C. Yuan, Z. H. Lin, Y. Y. Cheng, H. H. Wu. Self-adaptive double bootstrapped DDPG. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 3198–3204, 2018. DOI: 10.24963/ijcai.2018/444 doi.
B. Xi, R. Wang, S. Wang, T. Lu, Y. H. Cai. Conservative policy gradient in multi-critic setting. In Proceedings of Chinese Automation Congress, Hangzhou, China, pp. 1486–1489, 2019.
P. W. Chou, D. Maturana, S. Scherer. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the Beta distribution. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 834–843, 2017.
Y. H. Wu, E. Mansimov, S. Liao, R. Grosse, J. Ba. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 5279–5288, 2017.
P. N. Ward, A. Smofsky, A. J. Bose. Improving exploration in soft-actor-critic with normalizing flows policies. [Online], Available: https://arxiv.org/abs/1906.02771, 2019.
J. Schulman, X. Chen, P. Abbeel. Equivalence between policy gradients and soft Q-Learning. [Online], Available: https://arxiv.org/abs/1704.06440, 2017.
T. Haarnoja, H. R. Tang, P. Abbeel, S. Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 1352–1361, 2017.
T. Haarnoja, A. Zhou, P. Abbeel, S. Levine. Soft actor-critic: Off-Policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 1856–1865, 2018.
E. Uchibe. Cooperative and competitive reinforcement and imitation learning for a mixture of heterogeneous learning modules. Frontiers in Neurorobotics, vol. 12, Article number 61, 2018. DOI:https://doi.org/10.3389/fnbot.2018.00061.
T. Schaul, J. Quan, I. Antonoglou, D. Silver. Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
D. C. Zha, K. H. Lai, K. X. Zhou, X. Hu. Experience replay optimization. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 4243–4249, 2019.
C. Wang, K. Ross. Boosting soft actor-critic: Emphasizing recent experience without forgetting the past. [Online], Available: https://arxiv.org/abs/1906.04009, 2019.
M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, W. Zaremba. Hindsight experience replay. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, USA, pp. 5048–5058, 2017.
E. Todorov, T. Erez, Y. Tassa. MuJoCo: A physics engine for model-based control. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Vilamoura-Algarve, Portugal, pp. 5026–5033, 2012. DOI:https://doi.org/10.1109/IROS.2012.6386109.
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba. OpenAI gym. [Online], Available: https://arxiv.org/abs/1606.01540, 2016.
D. P. Kingma, J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2014.
Acknowledgements
This work was supported by National Key Research and Development Program of China (NO. 2018AAA 0103003), National Natural Science Foundation of China (NO. 61773378), Basic Research Program (NO. JCKY *******B029), and Strategic Priority Research Program of Chinese Academy of Science (NO. XDB32050100).
Author information
Authors and Affiliations
Corresponding author
Additional information
Recommended by Associate Editor Wing Cheong Daniel Ho
Colored figures are available in the online version at https://link.springer.com/journal/11633
Bao Xi received the B. Sc. degree in automation, and the M. Eng. degree in control science and engineering from Xi’an Jiaotong University (XJTU), China in 2013 and 2016, respectively, and received the Ph. D. degree in control theory and control engineering at State Key Laboratory of Management and Control or Complex Systems, Institute of Automation, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China in 2021.
His research interests include robotics and automation.
E-mail: xi_bao@foxmail.com
ORCID iD: 0000-0003-1495-8802
Rui Wang received the B. Eng. degree in automation from Beijing Institute of Technology, China in 2013, and the Ph.D. degree in control theory and control engineering from Institute of Automation, Chinese Academy of Sciences (CASIA), China in 2018. He is currently an assistant professor with State Key Laboratory of Management and Control for Complex Systems, CASIA.
His research interests include intelligent control, robotics, underwater robots, and biomimetic robots.
E-mail: rwang5212@ia.ac.cn
ORCID iD: 0000-0003-3172-3167
Ying-Hao Cai received the Ph.D. degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, China in 2009. She was a postdoctoral research associate in Institute of Robotics and Intelligent Systems, University of Southern California, USA, and a senior research scientist in Machine Vision Group, University of Oulu, Finland. She is an associate professor in Institute of Automation, Chinese Academy of Sciences, China.
Her research interests include object detection and tracking and computer vision in robotics.
E-mail: yinghao.cai@ia.ac.cn
ORCID iD: 0000-0003-3024-2943
Tao Lu received the B. Eng. degree in control engineering from Shandong University, China in 2002, and the Ph. D. degree in control theory and control engineering from Institute of Automation, Chinese Academy of Sciences, China in 2007. He is currently an associate professor in State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China.
His research interest is reinforcement learning in robot manipulation.
E-mail: tao.lu@ia.ac.cn
Shuo Wang received the B.Eng. degree in electrical engineering from Shenyang Architecture and Civil Engineering Institute, China in 1995, received the M.Eng. degree in industrial automation from the Northeastern University, China in 1998, and received the Ph. D. degree in control theory and control engineering from the Institute of Automation, Chinese Academy of Sciences, China in 2001. He is currently a professor in State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, the Center for Excellence in Brain Science and Intelligence Technology of Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China.
His research interests include biomimetic robot, underwater robot, and multirobot systems.
E-mail: shuo.wang@ia.ac.cn (Corresponding author)
ORCID iD: 0000-0002-1390-9219
Rights and permissions
About this article
Cite this article
Xi, B., Wang, R., Cai, YH. et al. A Novel Heterogeneous Actor-critic Algorithm with Recent Emphasizing Replay Memory. Int. J. Autom. Comput. 18, 619–631 (2021). https://doi.org/10.1007/s11633-021-1296-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-021-1296-x