Skip to main content

Normalizing Flow Policies for Multi-agent Systems

  • Conference paper
  • First Online:
Decision and Game Theory for Security (GameSec 2020)

Abstract

Stochastic policy gradient methods using neural representations have had considerable success in single-agent domains with continuous action spaces. These methods typically use networks that output the parameters of a diagonal Gaussian distribution from which the resulting action is sampled. In multi-agent contexts, however, better policies may require complex multimodal action distributions. Based on recent progress in density modeling, we propose an alternative for policy representation in the form of conditional normalizing flows. This approach allows for greater flexibility in action distribution representation beyond mixture models. We demonstrate their advantage over standard methods on a set of tasks including human behavior modeling and reinforcement learning in multi-agent settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/StanfordASL/TrafficWeavingCVAE.

References

  1. Balduzzi, D., Tuyls, K., Perolat, J., Graepel, T.: Re-evaluating evaluation. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 3268–3279 (2018)

    Google Scholar 

  2. Bansal, T., Pachocki, J., Sidor, S., Sutskever, I., Mordatch, I.: Emergent complexity via multi-agent competition. In: International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  3. Bhattacharyya, R.P., Phillips, D.J., Liu, C., Gupta, J.K., Driggs-Campbell, K., Kochenderfer, M.J.: Simulating emergent properties of human driving behavior using multi-agent reward augmented imitation learning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 789–795. IEEE (2019)

    Google Scholar 

  4. Blum, A., Mansour, Y.: Learning, regret minimization, and equilibria. In: Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.V. (eds.) Algorithmic Game Theory, chap. 4, pp. 79–102. Cambridge University Press (2007)

    Google Scholar 

  5. Brown, G.W.: Iterative solution of games by fictitious play. In: Activity Analysis of Production and Allocation, vol. 13, no. 1, pp. 374–376 (1951)

    Google Scholar 

  6. Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 38(2), 156–172 (2008)

    Article  Google Scholar 

  7. Cermák, J., Bošanský, B., Durkota, K., Lisý, V., Kiekintveld, C.: Using correlated strategies for computing Stackelberg equilibria in extensive-form games. In: AAAI Conference on Artificial Intelligence (2016)

    Google Scholar 

  8. Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)

  9. Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using Real NVP. arXiv preprint arXiv:1605.08803 (2016)

  10. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning (ICML), pp. 1329–1338 (2016)

    Google Scholar 

  11. Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: International Conference on Machine Learning (ICML), pp. 881–889 (2015)

    Google Scholar 

  12. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680 (2014)

    Google Scholar 

  13. Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: International Conference on Machine Learning (ICML), pp. 1352–1361 (2017)

    Google Scholar 

  14. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)

  15. Haskell, W.B., Kar, D., Fang, F., Tambe, M., Cheung, S., Denicola, E.: Robust protection of fisheries with COmPASS. In: AAAI Conference on Artificial Intelligence (2014)

    Google Scholar 

  16. Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121 (2016)

  17. Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 4565–4573 (2016)

    Google Scholar 

  18. Hoen, P.J., Tuyls, K., Panait, L., Luke, S., La Poutré, J.A.: An overview of cooperative and competitive multiagent learning. In: Tuyls, K., Hoen, P.J., Verbeeck, K., Sen, S. (eds.) LAMAS 2005. LNCS (LNAI), vol. 3898, pp. 1–46. Springer, Heidelberg (2006). https://doi.org/10.1007/11691839_1

    Chapter  Google Scholar 

  19. Johnson, M.P., Fang, F., Tambe, M.: Designing patrol strategies to maximize pristine forest area. In: AAAI Conference on Artificial Intelligence (2012)

    Google Scholar 

  20. Kamra, N., Fang, F., Kar, D., Liu, Y., Tambe, M.: Handling continuous space security games with neural networks. In: IWAISe: International Workshop on Artificial Intelligence in Security, p. 17 (2017)

    Google Scholar 

  21. Kamra, N., Gupta, U., Fang, F., Liu, Y., Tambe, M.: Policy learning for continuous space security games using neural networks. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  22. Kamra, N., Gupta, U., Wang, K., Fang, F., Liu, Y., Tambe, M.: DeepFP for finding nash equilibrium in continuous action spaces. In: Alpcan, T., Vorobeychik, Y., Baras, J.S., Dán, G. (eds.) GameSec 2019. LNCS, vol. 11836, pp. 238–258. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32430-8_15

    Chapter  Google Scholar 

  23. Kiekintveld, C., Jain, M., Tsai, J., et al.: Computing optimal randomized resource allocations for massive security games. In: International Conference on Autonomous Agents and Multi-agent Systems, pp. 689–696 (2009)

    Google Scholar 

  24. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)

    Google Scholar 

  25. Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1\(\times \)1 convolutions. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 10215–10224 (2018)

    Google Scholar 

  26. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (ICLR) (2013)

    Google Scholar 

  27. Kochenderfer, M.J.: Decision Making Under Uncertainty: Theory and Application. MIT Press, Cambridge (2015)

    Book  Google Scholar 

  28. Lanctot, M., Zambaldi, V., Gruslys, A., et al.: A unified game-theoretic approach to multiagent reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 4190–4203 (2017)

    Google Scholar 

  29. Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

  30. Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: Machine Learning, pp. 157–163. Elsevier (1994)

    Google Scholar 

  31. Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T.: Emergent coordination through competition. In: International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  32. Muñoz-Garcia, F.: Advanced Microeconomic Theory: An Intuitive Approach with Examples. MIT Press, Cambridge (2017)

    Google Scholar 

  33. Omidshafiei, S., et al.: \(\alpha \)-rank: multi-agent evaluation by evolution. Sci. Rep. 9(1), 1–29 (2019)

    Article  Google Scholar 

  34. Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762 (2019)

  35. Peyré, G., Cuturi, M., et al.: Computational optimal transport. Found. Trends® Mach. Learn. 11(5–6), 355–607 (2019)

    Article  Google Scholar 

  36. Pomerleau, D.: Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 3, 88–97 (1991)

    Article  Google Scholar 

  37. Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning (ICML), pp. 1530–1538 (2015)

    Google Scholar 

  38. Rhinehart, N., Kitani, K.M., Vernaza, P.: r2p2: a ReparameteRized pushforward policy for diverse, precise generative path forecasting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 794–811. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_47

    Chapter  Google Scholar 

  39. Rosenfeld, A., Kraus, S.: When security games hit traffic: optimal traffic enforcement under one sided uncertainty. In: International Joint Conferences on Artificial Intelligence (IJCAI), pp. 3814–3822 (2017)

    Google Scholar 

  40. Schmerling, E., Leung, K., Vollprecht, W., Pavone, M.: Multimodal probabilistic model-based planning for human-robot interaction. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9 (2017)

    Google Scholar 

  41. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  42. Shapley, L.S.: Stochastic games. Proc. Natl. Acad. Sci. 39(10), 1095–1100 (1953). ISSN 0027–8424

    Article  MathSciNet  Google Scholar 

  43. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484 (2016)

    Article  Google Scholar 

  44. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning (ICML), pp. 387–395 (2014)

    Google Scholar 

  45. Silver, D., Schrittwieser, J., Simonyan, K., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)

    Article  Google Scholar 

  46. Tambe, M.: Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  47. Tang, Y., Agrawal, S.: Implicit policy for reinforcement learning. arXiv preprint arXiv:1806.06798 (2018)

  48. Taylor, P.D., Jonker, L.B.: Evolutionary stable strategies and game dynamics. Math. Biosci. 40(1–2), 145–156 (1978)

    Article  MathSciNet  Google Scholar 

  49. Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019). ISSN 0028–0836

    Article  Google Scholar 

  50. Wang, B., Zhang, Y., Zhou, Z.H., Zhong, S.: On repeated Stackelberg security game with the cooperative human behavior model for wildlife protection. Appl. Intell. 49, 1002–1015 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaobai Ma .

Editor information

Editors and Affiliations

6 Appendix

6 Appendix

Policy Implementation. Our implementation is based on the Garage [10] reinforcement learning library. We use a multi-layer perceptron (MLP) consisting of 3 hidden layers with 64, 64, and 32 hidden units for the Gaussian, Cholesky Gaussian, GMM and MCG policies. The mean and covariance (and weights for mixture models) use the same MLP except the last layer for better knowledge sharing. The NFP1 policy uses the standard RealNVP structure with 5 coupling layers. An additional state conditioning layer is added after the first coupling layer. The state conditioning layer uses an MLP of 2 hidden layers with 64 hidden units [47]. For the proposed conditional flow policy, we use 5 coupling layers and each coupling layer has an MLP of 2 hidden layers with 32 hidden units. The MLP takes the concatenation of the observation and half of the latent variables (\(x_{1:d},d=\lfloor D/2 \rfloor \)) as the input, and output the scale and translation factors \(\alpha \) and t as introduced in Sect. 3.1. The output \(\alpha \) is then clipped between \([-5,5]\) for better numerical stability. We additionally add a tanh on the final outputs for all policies similar to Haarnoja et al. [14], which helps limit the policy output space as well as bound the entropy term in loss. We make sure that the total number of parameters for different models stay close to \(10^4\) for a fair comparison. All the hidden layers use ReLU activations. For the farm security game, since the inputs are images, we additionally add a convolutional neural network (CNN) of 2 convolution layers as the feature extractor to all models. The convolution layers have 32 and 16 channels. The filter sizes are \(16\times 16\) and \(4 \times 4\), and the strides are \(8 \times 8\) and \(2 \times 2\). This CNN structure is suggested by Kamra et al. [20].

Agent Modeling. We use behavior cloning as our training algorithm in Sect. 4.1 which maximizes the likelihood of actions in the training data [36]. We use a batch size of 1024. The learning rate starts from 0.01 and decays at a rate of 0.8 every 1000 iterations. We train each policy with \(5 \times 10^{3}\) and \(2 \times 10^{4}\) iterations on the synthetic and real world datasets.

Multi-agent RL. We use proximal policy optimization (PPO) [41] as our policy optimization algorithm in Sect. 4.2. We add an extra entropy term to the loss function for better exploration. The entropy of the policies are estimated using the negative log-likelihood of one sampled action for each state. The weight of the entropy loss starts with 1.0 and decays at a rate of 0.999 per iteration. We train all players independently at the same time. We use the Adam optimizer [24] with a fixed learning rate of \(10^{-4}\). The training lasts \(10^4\) epochs with a batch size of 512 for the repeated iterated games and 2048 for the farm security game.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, X., Gupta, J.K., Kochenderfer, M.J. (2020). Normalizing Flow Policies for Multi-agent Systems. In: Zhu, Q., Baras, J.S., Poovendran, R., Chen, J. (eds) Decision and Game Theory for Security. GameSec 2020. Lecture Notes in Computer Science(), vol 12513. Springer, Cham. https://doi.org/10.1007/978-3-030-64793-3_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64793-3_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64792-6

  • Online ISBN: 978-3-030-64793-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics