Hostname: page-component-848d4c4894-75dct Total loading time: 0 Render date: 2024-05-17T07:05:59.634Z Has data issue: false hasContentIssue false

Fully distributed actor-critic architecture for multitask deep reinforcement learning

Published online by Cambridge University Press:  16 April 2021

Sergio Valcarcel Macua
Affiliation:
Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com
Ian Davies
Affiliation:
Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com
Aleksi Tukiainen
Affiliation:
Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com
Enrique Munoz de Cote
Affiliation:
Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com

Abstract

We propose a fully distributed actor-critic architecture, named diffusion-distributed-actor-critic Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual-ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.

Type
Research Article
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Andreas, J., Klein, D. & Levine, S. 2017. Modular multitask reinforcement learning with policy sketches. In Proceedings of the International Conference on Machine Learning (ICML), 166175.Google Scholar
Arrow, K. J., Hurwicz, L. & Uzawa, H. 1958. Studies in Linear and Non-linear Programming. Stanford University Press.Google Scholar
Assran, M., Romoff, J., Ballas, N., Pineau, J. & Rabbat, M. 2019. Gossip-based actor-learner architectures for deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 1332013330.Google Scholar
Baird, L. C. III 1993. Advantage Updating, Technical report, Wright Lab Wright-Patterson AFB OH.Google Scholar
Bertsekas, D. P. 2009. Convex Optimization Theory. Athena Scientific.Google Scholar
Bertsekas, D. P. 2012. Dynamic Programming and Optimal Control, 4th edition, 2. Athena Scientific.Google Scholar
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M. & Lee, M. 2009. Natural actor-critic algorithms. Automatica 45(11), 24712482.CrossRefGoogle Scholar
Bianchi, P. & Jakubowicz, J. 2013. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Transactions on Automatic Control 58(2), 391405.CrossRefGoogle Scholar
Borkar, V. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.Google Scholar
Borkar, V. S. 1997. Stochastic approximation with two time scales. Systems and Control Letters 29(5), 291294.CrossRefGoogle Scholar
Borkar, V. S. & Meyn, S. 1999. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 447469.CrossRefGoogle Scholar
Bou-Ammar, H., Eaton, E., Ruvolo, P. & Taylor, M. 2014. Online multi-task learning for policy gradient methods. In Proceedings of the International Conference on Machine Learning (ICML), 12061214.Google Scholar
Boyd, S. & Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press.Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. 2016. OpenAI Gym.Google Scholar
Chen, J. & Sayed, A. H. 2013. Distributed Pareto optimization via diffusion strategies. IEEE Journal of Selected Topics in Signal Processing 7(2), 205220.CrossRefGoogle Scholar
El Bsat, S., Bou-Ammar, H. & Taylor, M. E. 2017. Scalable multitask policy gradient reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 18471853.Google Scholar
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I. et al. 2018. Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML.Google Scholar
Fu, J., Levine, S. & Abbeel, P. 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In IEEE RSJ International Conference on Intelligent Robots and Systems (IROS), 40194026.Google Scholar
Golub, G. & Van Loan, C. 1996. Matrix Computations. Johns Hopkins University Press.Google Scholar
Grondman, I., Busoniu, L., Lopes, G. A. D. & Babuska, R. 2012. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6), 12911307.CrossRefGoogle Scholar
Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T. & Tassa, Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), 29262934.Google Scholar
Horn, R. & Johnson, C. 1990. Matrix Analysis. Cambridge University Press.Google Scholar
Kar, S., Moura, J. M. F. & Poor, H. V. 2013. QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing 61(7), 18481862.CrossRefGoogle Scholar
Karp, R. M. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Springer, 85103.Google Scholar
Kingma, D. & Ba, J. L. 2015. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Kober, J. & Peters, J. R. 2009. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), 849856.Google Scholar
Konda, V. R. & Tsitsiklis, J. N. 2003. On actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 11431166.CrossRefGoogle Scholar
Lakshminarayanan, C. & Bhatnagar, S. 2017. A stability criterion for two timescale stochastic approximation schemes. Automatica 79, 108114.CrossRefGoogle Scholar
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. & Wierstra, D. 2015. Continuous control with deep reinforcement learning.Google Scholar
Melo, F. S. & Lopes, M. 2008. Fitted natural actor-critic: a new algorithm for continuous state-action MDPs. In Machine Learning and Knowledge Discovery in Databases, 5212. Springer, 6681.Google Scholar
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. & Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Riedmiller, M. 2013. Playing atari with deep reinforcement learning. arXiv preprint .Google Scholar
Ng, A. Y., Parr, R. & Koller, D. 1999. Policy search via density estimation. In Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 10221028.Google Scholar
Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V. & Song, D. 2018. Assessing generalization in deep reinforcement learning.Google Scholar
Parisotto, E., Ba, J. L. & Salakhutdinov, R. 2016. Actor-mimic: deep multitask and transfer reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Powell, W. B. & Ma, J. 2011. A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications 9(3), 336352.CrossRefGoogle Scholar
Puterman, M. L. 2005. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edition. John Wiley & Sons.Google Scholar
Ramaswamy, A. & Bhatnagar, S. 2017. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research 42(3), 648661.CrossRefGoogle Scholar
Sayed, A. H. 2014. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning 7(4–5), 311801.CrossRefGoogle Scholar
Scherrer, B. 2010. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In Proceedings of the International Conference on Machine Learning (ICML), 959966.Google Scholar
Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint .Google Scholar
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C. & Wiewiora, E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the International Conference on Machine Learning (ICML), 9931000.Google Scholar
Sutton, R. S., Mcallester, D., Singh, S. & Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 10571063.Google Scholar
Tadic, V. B. 2004. Almost sure convergence of two time-scale stochastic approximation algorithms. In IEEE American Control Conference, 4, 38023807.Google Scholar
Taylor, M. E. & Stone, P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 16331685.Google Scholar
Teh, Y. W., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N. & Pascanu, R. 2017. Distral: robust multitask reinforcement learning. arXiv preprint .Google Scholar
Tieleman, T. & Hinton, G. 2012. Lecture 6.5- RMSProp: divide the gradient by a running average of its recent magnitude.Google Scholar
Tomczak, M. B., Valcarcel Macua, S., de Cote, E. M. & Vrancx, P. 2019. Compatible features for monotonic policy improvement.Google Scholar
Tutunov, R., Bou-Ammar, H. & Jadbabaie, A. 2016. An exact distributed newton method for reinforcement learning. In IEEE Conference on Decision and Control (CDC), 10031008.Google Scholar
Valcarcel Macua, S. 2017. Distributed Optimization, Control and Learning in Multiagent Networks. PhD thesis, Universidad Politécnica de Madrid.Google Scholar
Valcarcel Macua, S., Chen, J., Zazo, S. & Sayed, A. H. 2015. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control 60(5), 12601274.CrossRefGoogle Scholar
Valcarcel Macua, S., Tukiainen, A., Hernández, D. G.-O., Baldazo, D., de Cote, E. M. & Zazo, S. 2017. Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. arXiv preprint .Google Scholar
van der Meulen, R. 2015. Gartner says 6.4 billion connected ‘things’ will be in use in 2016, up 30 percent from 2015. http://www.gartner.com/newsroom/id/3165317.Google Scholar
Van Hasselt, H. 2012. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning. Springer, 207251.Google Scholar
Wei, E. & Ozdaglar, A. 2012. Distributed alternating direction method of multipliers. In IEEE Annual Conf. Decision and Control (CDC), 54455450.Google Scholar
Weinstein, A. & Littman, M. L. 2012. Bandit-based planning and learning in continuous-action markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS).Google Scholar
Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J. & Schmidhuber, J. 2014. Natural evolution strategies. Journal of Machine Learning Research 15(1), 949980.Google Scholar
Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3–4), 229256.CrossRefGoogle Scholar
Yaji, V. G. & Bhatnagar, S. 2020. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent Markov noise. Mathematics of Operations Research. CrossRefGoogle Scholar
Zhang, K., Yang, Z., Liu, H., Zhang, T. & Basar, T. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In Proceedings International Conference on Machine Learning (ICML), 58725881.Google Scholar
Zhao, X. & Sayed, A. H. 2012. Performance limits for distributed estimation over LMS adaptive networks. IEEE Transactions on Signal Processing 60(10), 51075124.CrossRefGoogle Scholar
Zhao, X. & Sayed, A. H. 2015. Asynchronous adaptation and learning over networks—part i: modeling and stability analysis. IEEE Transactions on Signal Processing 63(4), 811826.CrossRefGoogle Scholar