Abstract
Stochastic policy gradient methods using neural representations have had considerable success in single-agent domains with continuous action spaces. These methods typically use networks that output the parameters of a diagonal Gaussian distribution from which the resulting action is sampled. In multi-agent contexts, however, better policies may require complex multimodal action distributions. Based on recent progress in density modeling, we propose an alternative for policy representation in the form of conditional normalizing flows. This approach allows for greater flexibility in action distribution representation beyond mixture models. We demonstrate their advantage over standard methods on a set of tasks including human behavior modeling and reinforcement learning in multi-agent settings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Balduzzi, D., Tuyls, K., Perolat, J., Graepel, T.: Re-evaluating evaluation. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 3268–3279 (2018)
Bansal, T., Pachocki, J., Sidor, S., Sutskever, I., Mordatch, I.: Emergent complexity via multi-agent competition. In: International Conference on Learning Representations (ICLR) (2018)
Bhattacharyya, R.P., Phillips, D.J., Liu, C., Gupta, J.K., Driggs-Campbell, K., Kochenderfer, M.J.: Simulating emergent properties of human driving behavior using multi-agent reward augmented imitation learning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 789–795. IEEE (2019)
Blum, A., Mansour, Y.: Learning, regret minimization, and equilibria. In: Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.V. (eds.) Algorithmic Game Theory, chap. 4, pp. 79–102. Cambridge University Press (2007)
Brown, G.W.: Iterative solution of games by fictitious play. In: Activity Analysis of Production and Allocation, vol. 13, no. 1, pp. 374–376 (1951)
Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 38(2), 156–172 (2008)
Cermák, J., Bošanský, B., Durkota, K., Lisý, V., Kiekintveld, C.: Using correlated strategies for computing Stackelberg equilibria in extensive-form games. In: AAAI Conference on Artificial Intelligence (2016)
Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using Real NVP. arXiv preprint arXiv:1605.08803 (2016)
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning (ICML), pp. 1329–1338 (2016)
Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: International Conference on Machine Learning (ICML), pp. 881–889 (2015)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680 (2014)
Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: International Conference on Machine Learning (ICML), pp. 1352–1361 (2017)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)
Haskell, W.B., Kar, D., Fang, F., Tambe, M., Cheung, S., Denicola, E.: Robust protection of fisheries with COmPASS. In: AAAI Conference on Artificial Intelligence (2014)
Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121 (2016)
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 4565–4573 (2016)
Hoen, P.J., Tuyls, K., Panait, L., Luke, S., La Poutré, J.A.: An overview of cooperative and competitive multiagent learning. In: Tuyls, K., Hoen, P.J., Verbeeck, K., Sen, S. (eds.) LAMAS 2005. LNCS (LNAI), vol. 3898, pp. 1–46. Springer, Heidelberg (2006). https://doi.org/10.1007/11691839_1
Johnson, M.P., Fang, F., Tambe, M.: Designing patrol strategies to maximize pristine forest area. In: AAAI Conference on Artificial Intelligence (2012)
Kamra, N., Fang, F., Kar, D., Liu, Y., Tambe, M.: Handling continuous space security games with neural networks. In: IWAISe: International Workshop on Artificial Intelligence in Security, p. 17 (2017)
Kamra, N., Gupta, U., Fang, F., Liu, Y., Tambe, M.: Policy learning for continuous space security games using neural networks. In: AAAI Conference on Artificial Intelligence (2018)
Kamra, N., Gupta, U., Wang, K., Fang, F., Liu, Y., Tambe, M.: DeepFP for finding nash equilibrium in continuous action spaces. In: Alpcan, T., Vorobeychik, Y., Baras, J.S., Dán, G. (eds.) GameSec 2019. LNCS, vol. 11836, pp. 238–258. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32430-8_15
Kiekintveld, C., Jain, M., Tsai, J., et al.: Computing optimal randomized resource allocations for massive security games. In: International Conference on Autonomous Agents and Multi-agent Systems, pp. 689–696 (2009)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1\(\times \)1 convolutions. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 10215–10224 (2018)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (ICLR) (2013)
Kochenderfer, M.J.: Decision Making Under Uncertainty: Theory and Application. MIT Press, Cambridge (2015)
Lanctot, M., Zambaldi, V., Gruslys, A., et al.: A unified game-theoretic approach to multiagent reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 4190–4203 (2017)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: Machine Learning, pp. 157–163. Elsevier (1994)
Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T.: Emergent coordination through competition. In: International Conference on Learning Representations (ICLR) (2018)
Muñoz-Garcia, F.: Advanced Microeconomic Theory: An Intuitive Approach with Examples. MIT Press, Cambridge (2017)
Omidshafiei, S., et al.: \(\alpha \)-rank: multi-agent evaluation by evolution. Sci. Rep. 9(1), 1–29 (2019)
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762 (2019)
Peyré, G., Cuturi, M., et al.: Computational optimal transport. Found. Trends® Mach. Learn. 11(5–6), 355–607 (2019)
Pomerleau, D.: Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 3, 88–97 (1991)
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning (ICML), pp. 1530–1538 (2015)
Rhinehart, N., Kitani, K.M., Vernaza, P.: r2p2: a ReparameteRized pushforward policy for diverse, precise generative path forecasting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 794–811. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_47
Rosenfeld, A., Kraus, S.: When security games hit traffic: optimal traffic enforcement under one sided uncertainty. In: International Joint Conferences on Artificial Intelligence (IJCAI), pp. 3814–3822 (2017)
Schmerling, E., Leung, K., Vollprecht, W., Pavone, M.: Multimodal probabilistic model-based planning for human-robot interaction. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9 (2017)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shapley, L.S.: Stochastic games. Proc. Natl. Acad. Sci. 39(10), 1095–1100 (1953). ISSN 0027–8424
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484 (2016)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning (ICML), pp. 387–395 (2014)
Silver, D., Schrittwieser, J., Simonyan, K., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)
Tambe, M.: Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned. Cambridge University Press, Cambridge (2011)
Tang, Y., Agrawal, S.: Implicit policy for reinforcement learning. arXiv preprint arXiv:1806.06798 (2018)
Taylor, P.D., Jonker, L.B.: Evolutionary stable strategies and game dynamics. Math. Biosci. 40(1–2), 145–156 (1978)
Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019). ISSN 0028–0836
Wang, B., Zhang, Y., Zhou, Z.H., Zhong, S.: On repeated Stackelberg security game with the cooperative human behavior model for wildlife protection. Appl. Intell. 49, 1002–1015 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
6 Appendix
6 Appendix
Policy Implementation. Our implementation is based on the Garage [10] reinforcement learning library. We use a multi-layer perceptron (MLP) consisting of 3 hidden layers with 64, 64, and 32 hidden units for the Gaussian, Cholesky Gaussian, GMM and MCG policies. The mean and covariance (and weights for mixture models) use the same MLP except the last layer for better knowledge sharing. The NFP1 policy uses the standard RealNVP structure with 5 coupling layers. An additional state conditioning layer is added after the first coupling layer. The state conditioning layer uses an MLP of 2 hidden layers with 64 hidden units [47]. For the proposed conditional flow policy, we use 5 coupling layers and each coupling layer has an MLP of 2 hidden layers with 32 hidden units. The MLP takes the concatenation of the observation and half of the latent variables (\(x_{1:d},d=\lfloor D/2 \rfloor \)) as the input, and output the scale and translation factors \(\alpha \) and t as introduced in Sect. 3.1. The output \(\alpha \) is then clipped between \([-5,5]\) for better numerical stability. We additionally add a tanh on the final outputs for all policies similar to Haarnoja et al. [14], which helps limit the policy output space as well as bound the entropy term in loss. We make sure that the total number of parameters for different models stay close to \(10^4\) for a fair comparison. All the hidden layers use ReLU activations. For the farm security game, since the inputs are images, we additionally add a convolutional neural network (CNN) of 2 convolution layers as the feature extractor to all models. The convolution layers have 32 and 16 channels. The filter sizes are \(16\times 16\) and \(4 \times 4\), and the strides are \(8 \times 8\) and \(2 \times 2\). This CNN structure is suggested by Kamra et al. [20].
Agent Modeling. We use behavior cloning as our training algorithm in Sect. 4.1 which maximizes the likelihood of actions in the training data [36]. We use a batch size of 1024. The learning rate starts from 0.01 and decays at a rate of 0.8 every 1000 iterations. We train each policy with \(5 \times 10^{3}\) and \(2 \times 10^{4}\) iterations on the synthetic and real world datasets.
Multi-agent RL. We use proximal policy optimization (PPO) [41] as our policy optimization algorithm in Sect. 4.2. We add an extra entropy term to the loss function for better exploration. The entropy of the policies are estimated using the negative log-likelihood of one sampled action for each state. The weight of the entropy loss starts with 1.0 and decays at a rate of 0.999 per iteration. We train all players independently at the same time. We use the Adam optimizer [24] with a fixed learning rate of \(10^{-4}\). The training lasts \(10^4\) epochs with a batch size of 512 for the repeated iterated games and 2048 for the farm security game.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ma, X., Gupta, J.K., Kochenderfer, M.J. (2020). Normalizing Flow Policies for Multi-agent Systems. In: Zhu, Q., Baras, J.S., Poovendran, R., Chen, J. (eds) Decision and Game Theory for Security. GameSec 2020. Lecture Notes in Computer Science(), vol 12513. Springer, Cham. https://doi.org/10.1007/978-3-030-64793-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-64793-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64792-6
Online ISBN: 978-3-030-64793-3
eBook Packages: Computer ScienceComputer Science (R0)