Fully distributed actor-critic architecture for multitask deep reinforcement learning

Sergio Valcarcel Macua; Ian Davies; Aleksi Tukiainen; Enrique Munoz de Cote

doi:10.1017/S0269888921000023

Fully distributed actor-critic architecture for multitask deep reinforcement learning

Part of: Adaptive Learning Agents 2018

Published online by Cambridge University Press: 16 April 2021

Sergio Valcarcel Macua

Ian Davies ,

Aleksi Tukiainen and

Enrique Munoz de Cote

Show author details

Sergio Valcarcel Macua: Affiliation:
Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com
Ian Davies: Affiliation:
Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com
Aleksi Tukiainen: Affiliation:
Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com
Enrique Munoz de Cote: Affiliation:
Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We propose a fully distributed actor-critic architecture, named diffusion-distributed-actor-critic Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual-ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.

Type: Research Article
Information: The Knowledge Engineering Review , Volume 36 , 2021 , e6

DOI: https://doi.org/10.1017/S0269888921000023 [Opens in a new window]
Copyright: © The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Andreas, J., Klein, D. & Levine, S. 2017. Modular multitask reinforcement learning with policy sketches. In Proceedings of the International Conference on Machine Learning (ICML), 166–175.Google Scholar

Arrow, K. J., Hurwicz, L. & Uzawa, H. 1958. Studies in Linear and Non-linear Programming. Stanford University Press.Google Scholar

Assran, M., Romoff, J., Ballas, N., Pineau, J. & Rabbat, M. 2019. Gossip-based actor-learner architectures for deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 13320–13330.Google Scholar

Baird, L. C. III 1993. Advantage Updating, Technical report, Wright Lab Wright-Patterson AFB OH.Google Scholar

Bertsekas, D. P. 2009. Convex Optimization Theory. Athena Scientific.Google Scholar

Bertsekas, D. P. 2012. Dynamic Programming and Optimal Control, 4th edition, 2. Athena Scientific.Google Scholar

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M. & Lee, M. 2009. Natural actor-critic algorithms. Automatica 45(11), 2471–2482.CrossRef Google Scholar

Bianchi, P. & Jakubowicz, J. 2013. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Transactions on Automatic Control 58(2), 391–405.CrossRef Google Scholar

Borkar, V. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.Google Scholar

Borkar, V. S. 1997. Stochastic approximation with two time scales. Systems and Control Letters 29(5), 291–294.CrossRef Google Scholar

Borkar, V. S. & Meyn, S. 1999. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 447–469.CrossRef Google Scholar

Bou-Ammar, H., Eaton, E., Ruvolo, P. & Taylor, M. 2014. Online multi-task learning for policy gradient methods. In Proceedings of the International Conference on Machine Learning (ICML), 1206–1214.Google Scholar

Boyd, S. & Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press.Google Scholar

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. 2016. OpenAI Gym.Google Scholar

Chen, J. & Sayed, A. H. 2013. Distributed Pareto optimization via diffusion strategies. IEEE Journal of Selected Topics in Signal Processing 7(2), 205–220.CrossRef Google Scholar

El Bsat, S., Bou-Ammar, H. & Taylor, M. E. 2017. Scalable multitask policy gradient reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 1847–1853.Google Scholar

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I. et al. 2018. Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML.Google Scholar

Fu, J., Levine, S. & Abbeel, P. 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In IEEE RSJ International Conference on Intelligent Robots and Systems (IROS), 4019–4026.Google Scholar

Golub, G. & Van Loan, C. 1996. Matrix Computations. Johns Hopkins University Press.Google Scholar

Grondman, I., Busoniu, L., Lopes, G. A. D. & Babuska, R. 2012. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6), 1291–1307.CrossRef Google Scholar

Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T. & Tassa, Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), 2926–2934.Google Scholar

Horn, R. & Johnson, C. 1990. Matrix Analysis. Cambridge University Press.Google Scholar

Kar, S., Moura, J. M. F. & Poor, H. V. 2013. QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing 61(7), 1848–1862.CrossRef Google Scholar

Karp, R. M. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Springer, 85–103.Google Scholar

Kingma, D. & Ba, J. L. 2015. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar

Kober, J. & Peters, J. R. 2009. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), 849–856.Google Scholar

Konda, V. R. & Tsitsiklis, J. N. 2003. On actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166.CrossRef Google Scholar

Lakshminarayanan, C. & Bhatnagar, S. 2017. A stability criterion for two timescale stochastic approximation schemes. Automatica 79, 108–114.CrossRef Google Scholar

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. & Wierstra, D. 2015. Continuous control with deep reinforcement learning.Google Scholar

Melo, F. S. & Lopes, M. 2008. Fitted natural actor-critic: a new algorithm for continuous state-action MDPs. In Machine Learning and Knowledge Discovery in Databases, 5212. Springer, 66–81.Google Scholar

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. & Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Riedmiller, M. 2013. Playing atari with deep reinforcement learning. arXiv preprint .Google Scholar

Ng, A. Y., Parr, R. & Koller, D. 1999. Policy search via density estimation. In Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 1022–1028.Google Scholar

Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V. & Song, D. 2018. Assessing generalization in deep reinforcement learning.Google Scholar

Parisotto, E., Ba, J. L. & Salakhutdinov, R. 2016. Actor-mimic: deep multitask and transfer reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar

Powell, W. B. & Ma, J. 2011. A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications 9(3), 336–352.CrossRef Google Scholar

Puterman, M. L. 2005. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edition. John Wiley & Sons.Google Scholar

Ramaswamy, A. & Bhatnagar, S. 2017. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research 42(3), 648–661.CrossRef Google Scholar

Sayed, A. H. 2014. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning 7(4–5), 311–801.CrossRef Google Scholar

Scherrer, B. 2010. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In Proceedings of the International Conference on Machine Learning (ICML), 959–966.Google Scholar

Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint .Google Scholar

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C. & Wiewiora, E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the International Conference on Machine Learning (ICML), 993–1000.Google Scholar

Sutton, R. S., Mcallester, D., Singh, S. & Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 1057–1063.Google Scholar

Tadic, V. B. 2004. Almost sure convergence of two time-scale stochastic approximation algorithms. In IEEE American Control Conference, 4, 3802–3807.Google Scholar

Taylor, M. E. & Stone, P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 1633–1685.Google Scholar

Teh, Y. W., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N. & Pascanu, R. 2017. Distral: robust multitask reinforcement learning. arXiv preprint .Google Scholar

Tieleman, T. & Hinton, G. 2012. Lecture 6.5- RMSProp: divide the gradient by a running average of its recent magnitude.Google Scholar

Tomczak, M. B., Valcarcel Macua, S., de Cote, E. M. & Vrancx, P. 2019. Compatible features for monotonic policy improvement.Google Scholar

Tutunov, R., Bou-Ammar, H. & Jadbabaie, A. 2016. An exact distributed newton method for reinforcement learning. In IEEE Conference on Decision and Control (CDC), 1003–1008.Google Scholar

Valcarcel Macua, S. 2017. Distributed Optimization, Control and Learning in Multiagent Networks. PhD thesis, Universidad Politécnica de Madrid.Google Scholar

Valcarcel Macua, S., Chen, J., Zazo, S. & Sayed, A. H. 2015. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control 60(5), 1260–1274.CrossRef Google Scholar

Valcarcel Macua, S., Tukiainen, A., Hernández, D. G.-O., Baldazo, D., de Cote, E. M. & Zazo, S. 2017. Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. arXiv preprint .Google Scholar

van der Meulen, R. 2015. Gartner says 6.4 billion connected ‘things’ will be in use in 2016, up 30 percent from 2015. http://www.gartner.com/newsroom/id/3165317.Google Scholar

Van Hasselt, H. 2012. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning. Springer, 207–251.Google Scholar

Wei, E. & Ozdaglar, A. 2012. Distributed alternating direction method of multipliers. In IEEE Annual Conf. Decision and Control (CDC), 5445–5450.Google Scholar

Weinstein, A. & Littman, M. L. 2012. Bandit-based planning and learning in continuous-action markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS).Google Scholar

Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J. & Schmidhuber, J. 2014. Natural evolution strategies. Journal of Machine Learning Research 15(1), 949–980.Google Scholar

Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3–4), 229–256.CrossRef Google Scholar

Yaji, V. G. & Bhatnagar, S. 2020. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent Markov noise. Mathematics of Operations Research. CrossRef Google Scholar

Zhang, K., Yang, Z., Liu, H., Zhang, T. & Basar, T. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In Proceedings International Conference on Machine Learning (ICML), 5872–5881.Google Scholar

Zhao, X. & Sayed, A. H. 2012. Performance limits for distributed estimation over LMS adaptive networks. IEEE Transactions on Signal Processing 60(10), 5107–5124.CrossRef Google Scholar

Zhao, X. & Sayed, A. H. 2015. Asynchronous adaptation and learning over networks—part i: modeling and stability analysis. IEEE Transactions on Signal Processing 63(4), 811–826.CrossRef Google Scholar

Article contents

Fully distributed actor-critic architecture for multitask deep reinforcement learning

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests