Abstract
We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment where coordination is key. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of escaping state space bottlenecks caused by interdependence steps since escaping those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritised experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The code is available at https://github.com/yamoling/bnaic-2023-lle.
References
Amodei, D., Olah, C., Steinhardt, J., Christiano, P.F., Schulman, J., Mané, D.: Concrete problems in AI safety. CoRR, abs/1606.06565 (2016)
Avalos, R., Reymond, M., Nowé, A., Roijers, D.M.: Local advantage networks for cooperative multi-agent reinforcement learning. In: AAMAS 2022: Proceedings of the 21st International Conference on Autonomous Agents and MultiAgent Systems (Extended Abstract) (2022)
Bard, N., et al.: The hanabi challenge: a new frontier for AI research. Artif. Intell. 280, 103216 (2020). ISSN: 00043702
Boutilier, C.: Planning, learning and coordination in multiagent decision processes. In: Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge, pp. 195–210 (1996)
Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random network distillation (2018)
Cao, Y., Wenwu, Yu., Ren, W., Chen, G.: An overview of recent progress in the study of distributed multi-agent coordination. IEEE Trans. Ind. Inf. 9(1), 427–438 (2013)
Claus, C., Boutilier, C.: The dynamics of reinforcement learning in cooperative multiagent systems. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, AAAI 1998 (1998)
Klima, R., et al.: Space debris removal: learning to cooperate and the price of anarchy. Front. Robot. AI 5, 54 (2018). ISSN: 2296-9144
Laurent, G.J., Matignon, L., Le Fort-Piat, N.: The world of independent learners is not Markovian. Int. J. Knowl.-Based Intell. Eng. Syst. 15(1), 55–64 (2011). ISSN: 18758827, 13272314
LeCun, Y., Bottou, L., Bengio, Y., Ha, P.: Gradient-based learning applied to document recognition (1998)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2016)
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Neural Information Processing Systems (NIPS) (2017)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015). ISSN: 1476-4687
Mordatch, I., Abbeel, P.: Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908 (2017)
Oliehoek, F.A., Amato, C.: A Concise Introduction to Decentralized POMDPs. SpringerBriefs in Intelligent Systems, Springer, Cham (2016). https://doi.org/10.1007/978-3-319-28929-8. ISBN: 978-3-319-28929-8
Oliehoek, F.A., Spaan, M.T.J., Vlassis, N.: Optimal and approximate Q-value functions for decentralized POMDPs. J. Artif. Intell. Res. 32, 289–353 (2008a). ISSN: 1076-9757
Oliehoek, F.A., Spaan, M.T.J., Whiteson, S.: Exploiting locality of interaction in factored Dec-POMDPs. In: International Joint Conference on Autonomous Agents and Multi-Agent Systems, pp. 517–524 (2008b)
Panait, L., Luke, S.: Cooperative multi-agent learning: the state of the art. Auton. Agents Multi-Agent Syst. 11(3), 387–434 (2005). ISSN: 1573-7454
Parker-Holder, J., et al.: Evolving curricula with regret-based environment design. In: International Conference on Machine Learning, pp. 17473–17498. PMLR (2022)
Rashid, T., Samvelyan, M., De Witt, C.S.: QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning (2018)
Samvelyan, M., et al.: The StarCraft multi-agent challenge. CoRR, abs/1902.04043 (2019)
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings, pp. 1–21 (2016)
Schmidhuber, J.: A possibility for implementing curiosity and boredom in model-building neural controllers. In: Meyer, J.-A. (ed.) From Animals to Animats, pp. 222–227. The MIT Press (1991). International Conference on Simulation Adaptive Behavior: From Animals to Animats Edition. ISBN: 978-0-262-25667-4
Son, K., Kim, D., Kang, W.J., Hostallero, D.E., Yi, Y.: QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning (2019)
Stern, R., et al.: Multi-agent pathfinding: definitions, variants, and benchmarks (2017)
Sunehag, P., et al.: Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, vol. 3, pp. 2085–2087 (2018). ISSN: 15582914. ISBN: 9781510868083
Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction (2018)
Sutton, R.S.: Temporal credit assignment in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst (1984). AAI8410337
Tuyls, K., Weiss, G.: Multiagent learning: basics, challenges, and prospects. AI Mag. 33(3), 41 (2012). ISSN: 2371-9621, 0738-4602
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning (2016). ISBN: 9781577357605
van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J.: Deep reinforcement learning and the deadly triad (2018)
Watkins, C.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge (1989)
Wu, S.A., Wang, R.E., Evans, J.A., Tenenbaum, J.B., Parkes, D.C., Kleiman-Weiner, M.: Too many cooks: coordinating multi-agent collaboration through inverse planning. Topics in Cognitive Science (2021)
Acknowledgements
Raphaël Avalos is supported by the FWO (Research Foundation – Flanders) under the grant 11F5721N. Tom Lenaerts is supported by an FWO project (grant nr. G054919N) and two FRS-FNRS PDR (grant numbers 31257234 and 40007793). His is furthermore supported by Service Public de Wallonie Recherche under grant no 2010235-ariac by digitalwallonia4.ai. Ann Nowé and Tom Lenaerts are also suported by the Flemish Government through the AI Research Program and TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Hyperparameters
A hyperparameter search has been performed with VDN on a combination of batch sizes (32, 64 and 128 transitions), memory sizes (50k, 100k and 200k transitions) and training intervals (1 and 5). Then, we performed a hyperparameter search for prioritised experience replay on a combination of \(\alpha \) (0.3, 0.4, 0.5, 0.6, 0.7, 0.8) and \(\beta \) (0.3, 0.4, 0.5, 0.6, 0.7, 0.8). For random network distillation, we have explored update ratios \(p \in \{0, 0.25, 0.5, 0.75 \}\).
B Neural Networks Architectures
1.1 B.1 Q-Network
The Q-network is made of two parts with an interconnection: a convolutional neural network of three layers, an interconnect that flattens the CNN, and finally a neural network of three linear layers. This is depicted in Table 3.
1.2 B.2 Random Network Distillation
The random network used for the intrinsic reward computation is a convolutional one similar to the Q-network and is depicted in Table 4. The frozen random network (the target) consists in the first part of the table with an output of size 512. The optimised network (the predictor) has an additional tail with one ReLU activation and one linear layer.
C Results of n-Steps Returns with VDN
We plot in Fig. 6 the score and exit rate over the course of the training on level 6 illustrated in Fig. 1. The agents are trained on VDN with n-steps return and with the hyperparameters shown in Table 2. These results show that a higher value of n yield worse results, as discussed in Sect. 5.3.
D Maps Provided by LLE
LLE comes with six predefined levels illustrated in Fig. 7.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Molinghen, Y., Avalos, R., Van Achter, M., Nowé, A., Lenaerts, T. (2025). Laser Learning Environment: A New Environment for Coordination-Critical Multi-agent Tasks. In: Oliehoek, F.A., Kok, M., Verwer, S. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2023. Communications in Computer and Information Science, vol 2187. Springer, Cham. https://doi.org/10.1007/978-3-031-74650-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-74650-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74649-9
Online ISBN: 978-3-031-74650-5
eBook Packages: Artificial Intelligence (R0)