Normalizing Flow Policies for Multi-agent Systems

Ma, Xiaobai; Gupta, Jayesh K.; Kochenderfer, Mykel J.

doi:10.1007/978-3-030-64793-3_15

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12513))

Included in the following conference series:

International Conference on Decision and Game Theory for Security

1470 Accesses
5 Citations

Abstract

Stochastic policy gradient methods using neural representations have had considerable success in single-agent domains with continuous action spaces. These methods typically use networks that output the parameters of a diagonal Gaussian distribution from which the resulting action is sampled. In multi-agent contexts, however, better policies may require complex multimodal action distributions. Based on recent progress in density modeling, we propose an alternative for policy representation in the form of conditional normalizing flows. This approach allows for greater flexibility in action distribution representation beyond mixture models. We demonstrate their advantage over standard methods on a set of tasks including human behavior modeling and reinforcement learning in multi-agent settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/StanfordASL/TrafficWeavingCVAE.

References

Balduzzi, D., Tuyls, K., Perolat, J., Graepel, T.: Re-evaluating evaluation. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 3268–3279 (2018)
Google Scholar
Bansal, T., Pachocki, J., Sidor, S., Sutskever, I., Mordatch, I.: Emergent complexity via multi-agent competition. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Bhattacharyya, R.P., Phillips, D.J., Liu, C., Gupta, J.K., Driggs-Campbell, K., Kochenderfer, M.J.: Simulating emergent properties of human driving behavior using multi-agent reward augmented imitation learning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 789–795. IEEE (2019)
Google Scholar
Blum, A., Mansour, Y.: Learning, regret minimization, and equilibria. In: Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.V. (eds.) Algorithmic Game Theory, chap. 4, pp. 79–102. Cambridge University Press (2007)
Google Scholar
Brown, G.W.: Iterative solution of games by fictitious play. In: Activity Analysis of Production and Allocation, vol. 13, no. 1, pp. 374–376 (1951)
Google Scholar
Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 38(2), 156–172 (2008)
Article Google Scholar
Cermák, J., Bošanský, B., Durkota, K., Lisý, V., Kiekintveld, C.: Using correlated strategies for computing Stackelberg equilibria in extensive-form games. In: AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using Real NVP. arXiv preprint arXiv:1605.08803 (2016)
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning (ICML), pp. 1329–1338 (2016)
Google Scholar
Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: International Conference on Machine Learning (ICML), pp. 881–889 (2015)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680 (2014)
Google Scholar
Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: International Conference on Machine Learning (ICML), pp. 1352–1361 (2017)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)
Haskell, W.B., Kar, D., Fang, F., Tambe, M., Cheung, S., Denicola, E.: Robust protection of fisheries with COmPASS. In: AAAI Conference on Artificial Intelligence (2014)
Google Scholar
Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121 (2016)
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 4565–4573 (2016)
Google Scholar
Hoen, P.J., Tuyls, K., Panait, L., Luke, S., La Poutré, J.A.: An overview of cooperative and competitive multiagent learning. In: Tuyls, K., Hoen, P.J., Verbeeck, K., Sen, S. (eds.) LAMAS 2005. LNCS (LNAI), vol. 3898, pp. 1–46. Springer, Heidelberg (2006). https://doi.org/10.1007/11691839_1
Chapter Google Scholar
Johnson, M.P., Fang, F., Tambe, M.: Designing patrol strategies to maximize pristine forest area. In: AAAI Conference on Artificial Intelligence (2012)
Google Scholar
Kamra, N., Fang, F., Kar, D., Liu, Y., Tambe, M.: Handling continuous space security games with neural networks. In: IWAISe: International Workshop on Artificial Intelligence in Security, p. 17 (2017)
Google Scholar
Kamra, N., Gupta, U., Fang, F., Liu, Y., Tambe, M.: Policy learning for continuous space security games using neural networks. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Kamra, N., Gupta, U., Wang, K., Fang, F., Liu, Y., Tambe, M.: DeepFP for finding nash equilibrium in continuous action spaces. In: Alpcan, T., Vorobeychik, Y., Baras, J.S., Dán, G. (eds.) GameSec 2019. LNCS, vol. 11836, pp. 238–258. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32430-8_15
Chapter Google Scholar
Kiekintveld, C., Jain, M., Tsai, J., et al.: Computing optimal randomized resource allocations for massive security games. In: International Conference on Autonomous Agents and Multi-agent Systems, pp. 689–696 (2009)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Google Scholar
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1\(\times \)1 convolutions. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 10215–10224 (2018)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (ICLR) (2013)
Google Scholar
Kochenderfer, M.J.: Decision Making Under Uncertainty: Theory and Application. MIT Press, Cambridge (2015)
Book Google Scholar
Lanctot, M., Zambaldi, V., Gruslys, A., et al.: A unified game-theoretic approach to multiagent reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 4190–4203 (2017)
Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: Machine Learning, pp. 157–163. Elsevier (1994)
Google Scholar
Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T.: Emergent coordination through competition. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Muñoz-Garcia, F.: Advanced Microeconomic Theory: An Intuitive Approach with Examples. MIT Press, Cambridge (2017)
Google Scholar
Omidshafiei, S., et al.: \(\alpha \)-rank: multi-agent evaluation by evolution. Sci. Rep. 9(1), 1–29 (2019)
Article Google Scholar
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762 (2019)
Peyré, G., Cuturi, M., et al.: Computational optimal transport. Found. Trends® Mach. Learn. 11(5–6), 355–607 (2019)
Article Google Scholar
Pomerleau, D.: Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 3, 88–97 (1991)
Article Google Scholar
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning (ICML), pp. 1530–1538 (2015)
Google Scholar
Rhinehart, N., Kitani, K.M., Vernaza, P.: r2p2: a ReparameteRized pushforward policy for diverse, precise generative path forecasting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 794–811. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_47
Chapter Google Scholar
Rosenfeld, A., Kraus, S.: When security games hit traffic: optimal traffic enforcement under one sided uncertainty. In: International Joint Conferences on Artificial Intelligence (IJCAI), pp. 3814–3822 (2017)
Google Scholar
Schmerling, E., Leung, K., Vollprecht, W., Pavone, M.: Multimodal probabilistic model-based planning for human-robot interaction. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9 (2017)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shapley, L.S.: Stochastic games. Proc. Natl. Acad. Sci. 39(10), 1095–1100 (1953). ISSN 0027–8424
Article MathSciNet Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484 (2016)
Article Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning (ICML), pp. 387–395 (2014)
Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)
Article Google Scholar
Tambe, M.: Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Tang, Y., Agrawal, S.: Implicit policy for reinforcement learning. arXiv preprint arXiv:1806.06798 (2018)
Taylor, P.D., Jonker, L.B.: Evolutionary stable strategies and game dynamics. Math. Biosci. 40(1–2), 145–156 (1978)
Article MathSciNet Google Scholar
Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019). ISSN 0028–0836
Article Google Scholar
Wang, B., Zhang, Y., Zhou, Z.H., Zhong, S.: On repeated Stackelberg security game with the cooperative human behavior model for wildlife protection. Appl. Intell. 49, 1002–1015 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Stanford University, Stanford, CA, 94305, USA
Xiaobai Ma, Jayesh K. Gupta & Mykel J. Kochenderfer

Authors

Xiaobai Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jayesh K. Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Mykel J. Kochenderfer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaobai Ma .

Editor information

Editors and Affiliations

Tandon School of Engineering, New York University, Brooklyn, NY, USA
Quanyan Zhu
ISR, University of Maryland, College Park, MD, USA
John S. Baras
Electrical Engineering, University of Washington, Seattle, WA, USA
Radha Poovendran
New York University, New York, NY, USA
Juntao Chen

6 Appendix

Policy Implementation. Our implementation is based on the Garage [10] reinforcement learning library. We use a multi-layer perceptron (MLP) consisting of 3 hidden layers with 64, 64, and 32 hidden units for the Gaussian, Cholesky Gaussian, GMM and MCG policies. The mean and covariance (and weights for mixture models) use the same MLP except the last layer for better knowledge sharing. The NFP1 policy uses the standard RealNVP structure with 5 coupling layers. An additional state conditioning layer is added after the first coupling layer. The state conditioning layer uses an MLP of 2 hidden layers with 64 hidden units [47]. For the proposed conditional flow policy, we use 5 coupling layers and each coupling layer has an MLP of 2 hidden layers with 32 hidden units. The MLP takes the concatenation of the observation and half of the latent variables (\(x_{1:d},d=\lfloor D/2 \rfloor \)) as the input, and output the scale and translation factors \(\alpha \) and t as introduced in Sect. 3.1. The output \(\alpha \) is then clipped between \([-5,5]\) for better numerical stability. We additionally add a tanh on the final outputs for all policies similar to Haarnoja et al. [14], which helps limit the policy output space as well as bound the entropy term in loss. We make sure that the total number of parameters for different models stay close to \(10^4\) for a fair comparison. All the hidden layers use ReLU activations. For the farm security game, since the inputs are images, we additionally add a convolutional neural network (CNN) of 2 convolution layers as the feature extractor to all models. The convolution layers have 32 and 16 channels. The filter sizes are \(16\times 16\) and \(4 \times 4\), and the strides are \(8 \times 8\) and \(2 \times 2\). This CNN structure is suggested by Kamra et al. [20].

Agent Modeling. We use behavior cloning as our training algorithm in Sect. 4.1 which maximizes the likelihood of actions in the training data [36]. We use a batch size of 1024. The learning rate starts from 0.01 and decays at a rate of 0.8 every 1000 iterations. We train each policy with \(5 \times 10^{3}\) and \(2 \times 10^{4}\) iterations on the synthetic and real world datasets.

Multi-agent RL. We use proximal policy optimization (PPO) [41] as our policy optimization algorithm in Sect. 4.2. We add an extra entropy term to the loss function for better exploration. The entropy of the policies are estimated using the negative log-likelihood of one sampled action for each state. The weight of the entropy loss starts with 1.0 and decays at a rate of 0.999 per iteration. We train all players independently at the same time. We use the Adam optimizer [24] with a fixed learning rate of \(10^{-4}\). The training lasts \(10^4\) epochs with a batch size of 512 for the repeated iterated games and 2048 for the farm security game.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, X., Gupta, J.K., Kochenderfer, M.J. (2020). Normalizing Flow Policies for Multi-agent Systems. In: Zhu, Q., Baras, J.S., Poovendran, R., Chen, J. (eds) Decision and Game Theory for Security. GameSec 2020. Lecture Notes in Computer Science(), vol 12513. Springer, Cham. https://doi.org/10.1007/978-3-030-64793-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-64793-3_15
Published: 22 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64792-6
Online ISBN: 978-3-030-64793-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Normalizing Flow Policies for Multi-agent Systems

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

6 Appendix

6 Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation